Post

25 Best NLP Datasets for Machine Learning

July 22, 2021

When it comes to natural language processing, beginning with the right datasets is key. As the field continuously grows in both its application and development, it’s important to have a strong footing and reference point for the data you’ll be using to train your model.

Build your own proprietary NLP dataset for ML. Get a quote for an end-to-end data solution to your specific requirements.

Talk with an expert

Considering our experience in NLP, we at iMerit have compiled this list of our top NLP datasets that we’ve used in the past to help clients optimize their NLP projects. While it’s nearly impossible to cover every field of interest within this rapidly-growing space, the sources on this list range from sentiment analysis to audio and voice recognition projects. This list should serve you well as a perfect starting point for all of your experiments.

NLP Datasets for Sentiment Analysis

With these large, highly-specialized datasets, training a Machine Learning model for sentiment analysis should be a breeze.

NLP Datasets for Sentiment Analysis - sentiment analysis

IMDB Reviews: With over 25,000 reviews across thousands of films, this dataset (while relatively small) is the perfect dataset for binary sentiment classification use cases.

Multi-Domain Sentiment Analysis Dataset: While this dataset may be slightly older, it features a massive variety of Amazon products along with their corresponding reviews.

Stanford Sentiment Treebank: This dataset is perfect for training a model to identify sentiment with the use of longer phrases with it’s 10,000+ Rotten Tomatoes reviews.

Sentiment140: With over 160,000 tweets, this popular dataset comes formatted within 6 fields including tweet data, query, text, polarity, ID, and user. 

Twitter US Airline Sentiment: This 2015 dataset features already-classified tweets (positive, neutral, negative) pertaining to US airlines.

Twitter US Airline Sentiment

Text Datasets

The following datasets are perfect for voice recognition and chatbots as it contains a broad range of datasets.

Text Datasets

20 Newsgroups: A collection featuring 20,000 documents that covers 20 newsgroups and subjects. The subjects are of particular interest as they outline everything from religion to popular sports.

Reuters News Dataset: Originally appearing in 1987, this dataset has been labeled, indexed, and compiled for use in machine learning. 

ArXiv: This massive 270 GB dataset features all arXiv research papers in fulltext.

The WikiQA Corpus: This publicly-available Q&A dataset was initially compiled to aid in all open-domain question answering research. 

UCI’s Spambase: This dataset was created by a team at HP (Hewlett-Packard) to help create a spam filter. It contains a litanie of emails previously labeled as spam by users. 

Yelp Reviews: This Yelp dataset features 8.5M+ reviews of over 160,000 businesses. It also has 200,000+ pictures and spans across 8 major metropolitan areas.

Yelp Reviews

WordNet: This dataset was compiled by Princeton University researchers as a large lexical database of English ‘synsets’. If you don’t know what that is, fret not; it’s essentially just groups of synonyms that adequately describe and outline unique and abstract concepts. 

The Blog Authorship Corpus: Containing over 681,000 blog posts written by 19,320 bloggers, this dataset holds over 140 million words. 

Audio Speech Datasets for Natural Language Processing

Natural language processing (NLP) benefits especially from audio speech datasets like the NLP datasets featured in this list from virtual assistants like in-car navigation and other sound-activated systems. 

Audio Speech Datasets for Natural Language Processing

2000 HUB5 English: Containing transcripts originally derived from 40 english-speaking telephone conversations, this dataset contains a slew of speech files for NLP.

LibriSpeech: Containing roughly 1,000 hours of English speech, this dataset is essentially a collection of audiobooks that have been organized by the chapters of the books they were derived from.

Spoken Wikipedia Corpora: The perfect dataset for anyone looking to move beyond just english-speaking, this dataset is composed of articles spoken in German, Dutch, and English. It contains a litany of unique and different readers across various subjects. 

Free Spoken Digit Dataset: This NLP dataset is composed of 1,500+ recordings of spoken digits in English. 

Free Spoken Digit Dataset

TIMIT: Designed for the development of automatic speech recognition systems, this dataset features recordings of over 600 unique American-English speakers reading from ten ‘phonetically rich’ passages. It’s especially useful for any research that pertains to acoustic-phonetic studies. 

NLP Datasets (General)

In case the above hasn’t delivered on what you need, here’s a few more across a gambit of different subjects/use cases. 

NLP Datasets (General)

Enron Dataset: This dataset contains 500,000+ messages of Enron officials’ emails and is especially of use for anyone looking to expand their understanding of the inner-workings of email tools. 

Google Books Ngrams: Ngrams are fixed size tuples of items. The N in Ngrams is meant to specify the number of elements in the tuple, so a 5-gram contains five words/characters.

Amazon Reviews: Featuring 35 million reviews of Amazon products spanning across 18 years, this dataset is especially useful for anyone needing user information, ratings, and plaintext reviews for sentiment analysis.

Wikipedia Links Data: This Google dataset contains approximately 13 million documents with each containing a hyperlink (one minimum each) that goes to an English wikipedia page. Each Wikipedia page is considered and treated like an entity.

Blogger Corpus: This Blogger.com collection of approximately 681,288 blog posts contains over 140 million words. Every blog contained within has 200 occurrences of the most commonly-used English words. 

Blogger Corpus

Gutenberg eBooks List: An annotated list of ebooks originally taken from Project Gutenberg, this NLP dataset features basic information surrounding each eBook, and has been organized based on the year it was published. 

Jeopardy: Containing 200,000+ Q&A’s from the quiz show jeopardy compiled by a saintly Reddit user, with each data point containing even more information such as air date, question, and even episode number. 

Hansards Text Chunks of Canadian Parliament: Containing 1.3 million pairs of text taken from the court reports of the 36th Canadian Parliament, this diverse dataset is useful for a variety of NLP applications. 

SMS Spam Collection in English: Perfect for building a spam filter as this NLP dataset contains 5500+ SMS messages in English that have each been tagged and annotated as either legitimate or spam.