Post

Top 25 Twitter Datasets for Natural Language Processing and Machine Learning

July 15, 2021

Social media data is a top asset for anyone training ML algorithms. While aggregating this data can be troublesome, teams of professionals from educational organizations and research teams have done the work to create open datasets for public use. 

In this article, we’ll list the top 25 Twitter datasets that can be used for models across sentiment analysis and content moderation.

1. COVID-19 Tweets

This Twitter dataset contains 150+ million tweets related to the COVID-19 global pandemic. The dataset spans just about every language with English, Spanish, and French being the most prevalent. 

COVID-19 Tweets

2. Avengers Endgame Tweets

Over 10,000 records revolving around #AvengersEndgame, the hit film from 2019.

3. UMass Global English on Twitter Dataset

Containing over 10K+ tweets, this dataset is randomly sampled from geotagged Twitter messages. Each tweet has been annotated based on its language.

4. Twitter User Data

This Twitter dataset contains 20,000 rows with each row featuring username, a random tweet, account profile and image/location information.

5. Apple Twitter Sentiment

This Twitter dataset focuses on tweets revolving around Apple. Each tweet contains the #AAPL hashtag, along with the reference @apple. Contributors were asked to classify each tweet as either Positive, Negative, or Neutral.

Apple Twitter Sentiment

6. Credibility Corpus in French and English

Composed of French and English tweets around rumors, this Twitter dataset was created to aid in the detection of misinformation.

7. Every Donald Trump Tweet

While this data has since been moved since it was originally posted, it can now be accessed from thetrumparchive.com.

8. Pre-processed Twitter Tweets

This collection of pre-processed tweets have been divided into positive, negative, and neutral categories for sentiment analysis.

9. Customer Support on Twitter

This substantial dataset features a litany of customer service support lines on Twitter, along with their respective and corresponding tweets and replies.

10. Charlottesville on Twitter

This Twitter dataset focuses on 150,000 tweets surrounding the Unite the Right rally that took place in Charlottesville. 

11. Game of Thrones Season 8 Tweets

This compilation of tweets focuses on the Twitter feedback that took place after each episode of the popular television series Game of Thrones.

12. Sentiment 140

Sentiment 140 is a company that has made their training data available to the public on their site. It is a tool that’s typically used for analyzing sentiments around specific topics, brands, or products that are talked about on Twitter.

13. SMILE Twitter Emotion

Ideal for sentiment analysis, this Twitter dataset contains over 3,000 tweets across a range of emotions including happiness, anger, outrage, sadness, and more.

14. Twitter Friends

This Twitter dataset for machine learning contains information around avatars, friend counts, User IDs, follower count, user language, ID of user’s last tweet, last post info, and more.

15. Twitter User Data

This Twitter dataset contains 20,000 rows featuring usernames, a corresponding random tweet, account profile, image, and location information.

16. Twitter News Dataset

This Twitter dataset focused on 5234 news events and their corresponding tweets.

17. UMass Global English on Twitter Dataset

This 10,000 tweet dataset was designed for creating classifiers that help in the identification of tweets. All tweets are animated as either English or non-English across the 130 countries they were gathered from. 

18. Stanford SNAP Twitter Dataset

With over 476 million tweets from 20 million users spanning a 7-month period, this Twitter dataset comes straight from the SNAP library database at Stanford University.

19. Top 20 Most-Followed Users on Twitter

These 52,000 tweets hail from the top 20 Twitter profiles. This dataset does not contain retweets.

20. Twitter Airline Sentiment

This Twitter dataset focuses on tweets relating to major US airlines, and is classified into positive, neutral, and negative sentiment.

21. COVID-19 Tweets

Yet another Twitter dataset that pertains to COVID-19, this dataset contains almost 1.5 billion tweets across the globe.

22. 2016 Presidential Election

Originally compiled to create some transparency against accusations of state-sponsored propaganda, this dataset focuses on tweets relating to the 2016 presidential election.

23. VoterFraud 2020 dataset

This Twitter dataset focuses on rumors circulating around voter fraud during the 2020 presidential election, and contains 7.6M tweets along with almost 26M retweets from 2.5M+ unique Twitter users.

24. 16 Million Unfiltered Tweets

These 16 million tweets were compiled between January 23rd and February 8th of 2011. The unfiltered nature of the tweets means that users will find both important tweets and spam tweets side by side.

25. Harvard Dataverse

This Twitter dataset focuses again on the 2016 presidential election, and contains 280M tweets gathered between July 13, 2016, and November 10, 2016.