Post

14 Best Chatbot Datasets for Machine Learning

July 22, 2021

In order to create a more effective chatbot, one must first compile realistic, task-oriented dialog data to effectively train the chatbot. Without this data, the chatbot will fail to quickly solve user inquiries or answer user questions without the need for human intervention. 

Since building a dialogue system to create natural-feeling conversations between humans and virtual agents, we at iMerit have compiled a list of the most successful and commonly-used datasets that are perfect for anyone looking to train a chatbot. Each of the entries on this list contains relevant data including customer support data, multilingual data, dialogue data, and question-answer data.

Question-Answer Datasets for Chatbot Training

The WikiQA Corpus: The WikiQA Corpus was made publicly available in 2015, and has been updated several times since its inception. It contains different sets of question and sentence pairs that were originally collected 

Question-Answer Database: This chatbot dataset was designed for use in Academic research, and features Wikipedia articles alongside manually-generated factoid questions that come from them. It also features manually-generated answers to the aforementioned questions.

Yahoo Language Data: This dataset is composed of manually curated QA datasets from Yahoo’s Yahoo Answers. 

TREC QA Collection: Since 1999, TREC’s answering track has been getting things done. Within each track, the systems defined the task in order to retrieve small snippets of text which each contained answers for open-domain, closed-class questions.

Customer Support Datasets for Chatbot Training

Relational Strategies in Customer Service Dataset: This dataset features human-computer data from three live customer services representatives who were working in the domain of travel and telecommunications. It also contains information from airline forums that were featured on TripAdvisor.com.

Ubuntu Dialogue Corpus: Consisting of almost one million two person conversations that have each been taken from the Ubuntu chat logs, this dataset is perfect for training a chatbot. It contains 930,000 dialogues spanning 100,000,000 words. 

Customer Support on Twitter: Consists of 3 million+ tweets pertaining to the largest brands on twitter.

Dialogue Datasets for Chatbot Training

Santa Barbara Corpus of Spoken American English: Consisting of approximately 249,000 words, the Santa Barbara Corpus of Spoken American English includes the transcripts, audios, and even timestamps that also effectively correlate transcription with audio at each level of individual intonation units. 

Semantic Web Interest Group IRC Chat Logs: The Semantic Web Intergest Group IRC Chat Logs is an automatically generated IRC chat log which includes daily chat logs along with their corresponding time stamps. 

Multi-Domain Wizard-of-Oz dataset (MultiWOZ): This large-scale human-human conversational corpus contains 8438 multi-turn dialogues with each dialogue averaging 14 turns. It’s unique from other chatbot datasets as it contains less than 10 slots and only a few hundred values. It also covers a slew of domains including restaurant, hotel, attraction, police, hospital, taxi, and train.

The NPS Chat Corpus: Consisting of 10,567 posts that have been gathered from a collection of 500,000 posts from various online chat services, the NPS Chat Corpus was created for non-commercial/non-profit educational and research use. Each piece of work is copyrighted with respect to all original authors. 

ConvAI2 Dataset: Collected during the ConvAI2 competition, this dataset features 2000+ dialogues involving human evaluators that were recruited from crowdsourcing platforms to chat with bots.

Cornell Movie-Dialogs Corpus: Containing a large metadata-rich collection of fictional conversations extracted from raw movie scripts, this dataset features 220,000+ conversations between 10,000+ pairs of movie characters. It involves more than 9,000 characters across 617 movies, and features a total of 304,713 utterances.

Multilingual Chatbot Training Datasets

EXCITEMENT Datasets: These datasets feature negative feedback from customers where specific reasons are given about their dissatisfaction with a particular company or service. Each of these datasets is available in English and Italian.

NUS Corpus: This crowdsourced SMS corpus was collected for research by the Department of Computer Science at the National University of Singapore, and consists of 67,093 SMS messages. Each of the messages were collected via volunteers who consented to having their messages made publicly available.