Post

Top 11 Reddit Datasets for Machine Learning

July 23, 2021

Reddit is way more than just a social media site; it’s a braintrust of objective collaborators coming together to build better things. In our time working with companies in the machine learning field, we at iMerit have found many datasets shared on Reddit to be tremendously useful when training a machine learning model.

Thanks to Reddit’s credible curation methods and dedicated moderators, the “front page of the internet” has some quality control that makes it much easier to trust versus Facebook, Twitter, or Instagram. It’s also the anonymity of Reddit’s users that allows them to say and post basically anything they choose. This makes Reddit perfect when testing or training natural language processing models such as content moderation or sentiment classification.

Here are our top picks of Reddit’s machine learning datasets. 

Best Reddit Datasets for Machine Learning

  1. Cryptocurrency Reddit Comments Dataset: Containing understandably volatile comments around a historically volatile asset, this dataset contains comments that were posted between November 2017 and March 2018.
  2. Donald Trump Comments on Reddit: Another volatile dataset featuring thousands of comments collected from Reddit around Donald Trump.
  3. Reddit Comment Score Prediction: Originally created to aid in the development of a model that could predict if a Reddit comment would be upvoted or downvoted, this dataset is composed of 4 million Reddit comments. Of the 4 million Reddit comments, 2 million were downvoted while another 2 million were upvoted.

Reddit News Datasets

  1. Daily News for Stock Market Prediction: Originally created to aid in the prediction of stock market fluctuations, the Daily News for Stock Market Prediction dataset consists of news that was gathered from the subreddit r/worldnews between June 2008 and July 2016. It also features Dow Jones Industrial Average stock information.
  2. World News on Reddit: Originally taken from the subreddit r/worldnews, this dataset is composed of news posts dating all the way back to 2008. It features information around the post itself including which date it was created, how many upvotes and downvotes it received, what it was titled, the author/publisher of the post, and whether or not the news within the comment contains mature content.
  3. COVID-19 News Dataset: This dataset features news articles gathered in January and February of 2020 around the earliest reports of the COVID-19 pandemic. It’s especially useful for any projects that are looking to see how the sentiment changed around the virus through each post.

Other Data from Reddit

  1. Reddit’s Top 1000: This dataset contains the top 1,000 reddit posts of all-time (in terms of upvotes) within 18 subreddits. The title of each post is contained within the CSV files along with the username of the post’s author/publisher. The dataset also contains upvote and downvote tallies, subreddit names, URLs, and other useful metadata.
  2. Reddit Usernames: This dataset’s greatest strength is in its simplicity. Containing a CSV file of 26 million usernames, this dataset also includes the total number of comments that each user has made.
  3. SARC: Self-Annotated Reddit Corpus for Sarcasm: Sarcasm is the kryptonite of sentiment analysis as it can be difficult to identify. This dataset was created to combat just that, and consists of 1.3 million sarcastic posts and comments across Reddit. The creator of this dataset also labeled the sarcasm within each comment while also including each post’s username, topic, and context.
  4. Science and Tech Acronyms from Reddit: Containing 140,000+ acronyms that were gathered across science, biology, technology, and futurology subreddits, this dataset is a CSV file that provides a forensic accounting of all comment ID’s, times, usernames, subreddit names, and acronyms mentioned.
  5. Things on Reddit (products): A product dataset featuring the top 100 Amazon products mentioned across every subreddit that’s ever posted/featured an Amazon product between 2015 and 2017, with each CSV file in the dataset featuring the product, category, and URL of the product. It also comes with the total mentions on Reddit for that product along with the corresponding subreddit mentions.