Post

25 Open Datasets for Data Science Projects

July 23, 2021

At iMerit, we’re constantly working with some of the brightest minds throughout the world. If you’re working on a data science project and find yourself in search of datasets, then look no further than this list we’ve compiled based on the successes of our clients. In this list, you’ll find highly-curated datasets that were created for linear regression, simple classification tasks, and predictive analysis.

MNIST Datasets

The original MNIST dataset was a benchmark dataset due to its small size and brilliantly simple format. That’s why it’s still used today as a test dataset for comparing algorithmic performances. If you’re looking for a starting point, then look no further than the original MNIST dataset with its 70,000 images (60,000 of which are for training, while the remaining 10,000 is for testing). If you find yourself wanting more, then by all means give the other entries on this list a try. 

  1. EMNIST: This series of 6 datasets was originally derived from the original NIST database. It contains a set of handwritten digits that have been converted into a 28×28 pixel image format.
  2. MNIST as JPG: The MNIST as JPG dataset is essentially the MNIST dataset reformatted into JPG files.
  3. 3D MNIST: 3D point cloud version of the original MNIST dataset. 
  4. Fashion MNIST: Featuring data from Zalando, this dataset features 70,000 images from the fashion retailer Zalando’s catalogue, and has been structured into the MNIST format.
  5. Skin Cancer MNIST: HAM10000: Contains 10,000+ medical images of skin lesions.

Linear Regression Datasets for Data Science

As predictive analytics and linear regression are the most common tasks new data scientists undertake, we’ve put together the following datasets.

  1. Cancer Linear Regression: Consisting of information from cancer.gov, this dataset is composed of cancer statistics in the United States.
  2. CDC Data: Nutrition, Physical Activity, Obesity: Derived from the CDC’s Behavior Risk Factor Surveillance System, this dataset is especially useful when studying how socioeconomic factors contribute to things like physical activity, poor nutrition, and obesity
  3. Medical Insurance Costs: Consisting of 1,338 rows of data focusing on health insurance costs and patient information, this dataset was derived from the “Machine Learning with R” book by Brett Lantz.
  4. OLS Regression Challenge: This challenge was originally created to aid in the development of predicting cancer mortality rates in the US. It contains information around death rates, reported cases, US county name, income per county, population, and demographics.
  5. Real Estate Price Prediction: This is a perfect dataset for projects revolving around predictive analysis, the Real Estate Price Prediction dataset consists of information around real estate purchases including purchase data, property age, location data, housing prices within each unit area, and proximity to stations.

Stock Market Datasets

The financial sector is embracing predictive analysis with open arms, and with good reason too. Any AI model that can predict stock behaviours stands to make any investor a lot of money. These datasets for projects were generated for building just such a predictive model.

  1. Historical Stock Market Dataset: Consists of historical stock/ETF prices and their corresponding volume information.
  2. Uniqlo Stock Price Prediction: Consisting of information from Japanese clothing retailer UNIQLO, this dataset for projects contains all of the company’s historical stock information.
  3. Currency Exchange Rates: This dataset contains daily currency exchange rates around 51 currencies between 1995 and 2018.
  4. Daily Prices for All Cryptocurrencies: One of the most volatile assets of all, this dataset features daily prices for every cryptocurrency between April 28th, 2013 and November 30th, 2018.
  5. News and Stock: Designed for Machine Learning classes, this dataset is perfect for binary classification tasks due to its historical news headline data derived from Reddit’s r/worldnews subreddit. 

Image Classification Datasets for Data Science

Here’s iMerit’s top 5 datasets for projects involving computer vision and image classification.

  1. Recursion Cellular Image Classification: Derived from the 2019 Recursion challenge, this dataset is the result of participants’ work using biological microscopy data to create a model that would be capable of identifying all duplicates.
  2. TensorFlow patch_camelyon Medical Images: Contains 327,000+ colored images in 96×96 pixel format.
  3. Indoor Scenes Images: The commonly touted MIT dataset for projects consists of over 15,000 images containing indoor settings and locations. It’s the perfect model dataset for projects involving scene recognition models. Each image is in JPEG format, and all images have been divided into 67 classes with each class containing 100 images.
  4. Intel Image Classification: Originally created for a contest that Intel hosted, this dataset contains 25,000 images that have been divided into several categories and folders for testing, training, and prediction.
  5. Sun397 Image Classification Dataset: Another Tensorflow dataset containing 108,000+ images that have all been divided into 397 categories.

Text Classification Datasets

  1. Recommender System Datasets: This repository was created and used by UCSD computer science professor Julian McAuley, and includes text data around product reviews, social networks, and question/answer data from multiple sources.
  2. Large Movie Review Dataset: Created by the Stanford AI Laboratory, this dataset consists of 50,000+ reviews of popular/polarizing films that have been divided into 25K for testing and 25K for training. It’s the perfect dataset for anyone looking to build or evaluate sentiment analysis algorithms.
  3. Twitter US Airline Sentiment Dataset: People have a lot to say about the airlines they fly with. This dataset features 15,000 tweets that have been classified as being either positive, negative, or neutral around six different airlines.
  4. Hate Speech and Other Offensive Language Dataset: This dataset was created for researching and identifying hate-speech online. All texts are classified as either hate-speech, offensive language, or neither. Viewer discretion advised.
  5. Stop Clickbait Dataset: The scourge of the internet (at least one of them), this dataset includes 16,000 article headlines from the buzziest and most annoying sites like Buzzfeed, with each headline categories as either being clickbait or not.

Best Places to Find Datasets for Data Science

Should you need more specific datasets for projects, Kaggle & Google Dataset Search have countless entries that should suit your needs.