When it comes to diverse and dynamic datasets around human behaviour, social media is the apex. Social scientists and business experts around the world turn to social media when looking to understand how people interact with the world around them. That’s exactly what makes social media such an effective tool for machine learning models, because it’s a written repository of human interaction complete with sentiments around popular subjects and interactivity.
As of 2019, the most popular English social media sites are Twitter, Facebook, and Reddit. By leveraging datasets from these platforms, businesses can inform machine learning models to understand and predict the general public’s reaction to a given product, event, or design.
Social Media Dataset Finders: Search and Download
Web scraping tools exist that search and download data from social media channels like Facebook, Twitter, LinkedIn, and Instagram. Before downloading or leveraging any of these datasets in an ML model, be sure to read the terms and services of the respective social media provider to ensure you’re not violating any of their policies.
Here are some fantastic social media dataset finders you can use:
Social Computing Data Repository: If you’re looking for a versatile and diverse breadth of social media content, then look no further than ASU’s Social Computing repository. Here you will find information and datasets from Twitter, Reddit, Youtube, and other popular social media sites in various formats, categories, and sizes.
Stanford Large Network Dataset Collection (SNAP): Looking for more social media datasets? Stanford has you covered. This repository of social media datasets is similar to ASU’s Social Computing Data Repository in that it features datasets predominantly from Twitter and Reddit. The size of each dataset will vary based on your needs and which one you select.
Network Repository: This collection of social media datasets includes every social network along with corresponding brain networks, web graphs, and more. If you’re looking for a more diverse set of information than the previous two entries, then look no further than this repository.
476 Million Twitter Tweets: Taken between June 1st and December 31st of 2009, this dataset comprises 30% of all public tweets between this 7 month period, and is primed and ready for any ML application.
Sentiment140: All tweet emoticons and emojis have been plucked out of these datasets. Each of the 160,000 tweets is perfect for anyone looking to evaluate these tweets for brand management.
Customer Support on Twitter: Kaggle’s dataset of over 3 million tweets and replies features some of the biggest brands on twitter. Great for sentiment analysis and brand tracking.
Cheng-Caverlee-Lee September 2009~January 2010 Twitter Scrape: This social media dataset was collected for the purposes of studying twitter geolocation data.
1.7 Billion Reddit Comments: This social media dataset features 1.7 billion JSON objects along with their corresponding comments, authors, scores, subreddits, and position in the comment tree. Users can also find other fields to look into if they use Reddit’s API.
May 2015 Reddit Comments: This social media dataset features all Reddit comments from May 2015 on NLP scripts.
Other Social Media Datasets
Youtube-8M Segments Dataset: The Youtube-8M Segments Dataset comes with human-verified segment annotations. It covers over 237K human-verified segment labels across 1000 classes from the validation set of the Youtube-8M dataset.
One Hundred Million Creative Commons Flickr Images for Research: WIth over 99 million images and 700K videos from Flickr, this dataset is considered one of the largest of its kind to ever be released. It’s especially useful for projects relating to computer vision. Many of the images are already geotagged, which means users can explore the compelling crossing of image and geographical features together.