Social media data is a top asset for anyone training ML algorithms. While aggregating this data can be troublesome, teams of professionals from educational organizations and research teams have done the work to create open datasets for public use.
In this article, we’ll list the top 25 Twitter datasets that can be used for models across sentiment analysis and content moderation.
This Twitter dataset contains 150+ million tweets related to the COVID-19 global pandemic. The dataset spans just about every language with English, Spanish, and French being the most prevalent.
Over 10,000 records revolving around #AvengersEndgame, the hit film from 2019.
Containing over 10K+ tweets, this dataset is randomly sampled from geotagged Twitter messages. Each tweet has been annotated based on its language.
This Twitter dataset contains 20,000 rows with each row featuring username, a random tweet, account profile and image/location information.
This Twitter dataset focuses on tweets revolving around Apple. Each tweet contains the #AAPL hashtag, along with the reference @apple. Contributors were asked to classify each tweet as either Positive, Negative, or Neutral.
Composed of French and English tweets around rumors, this Twitter dataset was created to aid in the detection of misinformation.
While this data has since been moved since it was originally posted, it can now be accessed from thetrumparchive.com.
This collection of pre-processed tweets have been divided into positive, negative, and neutral categories for sentiment analysis.
This substantial dataset features a litany of customer service support lines on Twitter, along with their respective and corresponding tweets and replies.
This Twitter dataset focuses on 150,000 tweets surrounding the Unite the Right rally that took place in Charlottesville.
This compilation of tweets focuses on the Twitter feedback that took place after each episode of the popular television series Game of Thrones.
12. Sentiment 140
Sentiment 140 is a company that has made their training data available to the public on their site. It is a tool that’s typically used for analyzing sentiments around specific topics, brands, or products that are talked about on Twitter.
Ideal for sentiment analysis, this Twitter dataset contains over 3,000 tweets across a range of emotions including happiness, anger, outrage, sadness, and more.
14. Twitter Friends
This Twitter dataset for machine learning contains information around avatars, friend counts, User IDs, follower count, user language, ID of user’s last tweet, last post info, and more.
This Twitter dataset contains 20,000 rows featuring usernames, a corresponding random tweet, account profile, image, and location information.
This Twitter dataset focused on 5234 news events and their corresponding tweets.
This 10,000 tweet dataset was designed for creating classifiers that help in the identification of tweets. All tweets are animated as either English or non-English across the 130 countries they were gathered from.
With over 476 million tweets from 20 million users spanning a 7-month period, this Twitter dataset comes straight from the SNAP library database at Stanford University.
These 52,000 tweets hail from the top 20 Twitter profiles. This dataset does not contain retweets.
This Twitter dataset focuses on tweets relating to major US airlines, and is classified into positive, neutral, and negative sentiment.
21. COVID-19 Tweets
Yet another Twitter dataset that pertains to COVID-19, this dataset contains almost 1.5 billion tweets across the globe.
Originally compiled to create some transparency against accusations of state-sponsored propaganda, this dataset focuses on tweets relating to the 2016 presidential election.
This Twitter dataset focuses on rumors circulating around voter fraud during the 2020 presidential election, and contains 7.6M tweets along with almost 26M retweets from 2.5M+ unique Twitter users.
These 16 million tweets were compiled between January 23rd and February 8th of 2011. The unfiltered nature of the tweets means that users will find both important tweets and spam tweets side by side.
This Twitter dataset focuses again on the 2016 presidential election, and contains 280M tweets gathered between July 13, 2016, and November 10, 2016.