Datasets serve as the railways upon which machine learning algorithms ride. Without them, any machine-learning algorithm will fail to progress in the domains of text classification, product categorization, and text mining.
We’ve compiled 60 open datasets for machine learning in this list, ranging from highly specific data to Amazon product datasets. Before you begin aggregating this data, it’s important to ensure a few things. First, be sure the datasets aren’t bloated, as you’ll likely not want to spend any time sifting through and cleaning up the data yourself. Second, keep in mind that datasets with fewer rows and columns take less time in general while also being easier to work with.
Top Five Open Dataset Finders
When mastering machine learning, practicing with different datasets is a great place to start. Luckily, finding them is easy.
Kaggle: This data science site contains a diverse set of compelling, independently-contributed datasets for machine learning. If you’re looking for niche datasets, Kaggle’s search engine allows you to specify categories to ensure the datasets you find will fit your bill.
UCI Machine Learning Repository: This mainstay of open datasets has been a go-to for decades. As many of the datasets are user-contributed it’s imperative to inspect them for quality as the levels of cleanliness can vary. It’s worth noting, however, that most of the datasets are clean, which is what makes this repository a go-to. Users can also download the data without needing to register.
Google Dataset Search: Dataset Search contains over 25 million datasets from all across the web. Whether they’re hosted on a publisher’s site, a government domain, or a researcher’s blog, Dataset Search can find it.
AWS Open Data Registry: Of course Amazon has their hands in the open dataset cookie jar as well. The shopping juggernaut brings their trademark resourcefulness to the dataset searching game. One key perk that differentiates AWS Open Data Registry is its user feedback feature, which allows users to add and modify datasets. Experience with AWS is also highly preferred in the job marketplace.
Wikipedia ML Datasets: This Wikipedia page features diverse datasets for machine learning including signal, image, sound, and text, to name a few.
Government Datasets for Machine Learning
If you’re looking for demographic data for your ML algorithms, then look no further than these government data portals. ML models trained via public government data can empower policymakers to recognize and anticipate trends that inform preemptive policy decisions.
Data USA: Data USA offers a fantastic array of powerfully visualized US public data. The information is digestible and readily accessible, making it easy to sift through and select if it’s right for you.
EU Open Data Portal: This open data portal offers over a million datasets across 36 european countries published by reputable EU institutions. The site has an easy-to-use interface that allows you to search for specific datasets across a variety of categories including Energy, Sports, Science, and Economics.
Data.gov: This site is fantastic for anyone looking to download a multitude of publicly- available data sources from US government agencies. The data is diverse, ranging from budgetary data to school performance scores. The information often requires additional research, which is something to keep in mind.
US Healthcare Data: A rich repository that naturally features tons of datasets around US healthcare data.
The UK Data Service: This data repository features the UK’s largest collection of social, economic, and population data.
School System Finances: A fabulous repository for anyone interested in education finance data such as revenues, expenditures, debt, and assets of elementary and secondary public school systems. The statistics on this site also cover school systems across the United States, including the District of Columbia.
The US National Center for Education Statistics: This repository contains information on educational institutions and demographics from not just the United States, but also around the world.
Finance & Economics Datasets for Machine Learning
Naturally the financial sector is embracing Machine Learning with open arms. As financial and economic quantitative records are typically kept meticulously, finance and economics are a great topic to roll out an AI or ML model atop of. It’s already happening too, as many investment firms are using algorithms to guide their stock picks, predictions, and trades. Machine learning is also being used in the field of economics for things like testing economic models, or analyzing and predicting the behavior of populations.
American Economic Association (AEA): The AEA is a fantastic source for US macroeconomic data.
Quandl: Another great source for economic and financial data particularly for building predictive models around stocks and economic indicators.
IMF Data: The International Monetary Fund keeps track and meticulously maintains records around foreign exchange reserves, investment outcomes, commodity prices, debt rates, and international finances.
World Bank Open Data: The World Bank’s datasets cover population demographics alongside a high number of economic and development indicators across the world.
Financial Times Market Data: Great for current information around commodities, foreign exchanges, and other worldwide financial markets.
Google Trends: Google trends gives you the freedom to examine and analyze all internet search activity, and also gives glimpses into which stories are trending around the world.
Image Datasets for Computer Vision
Anyone looking to train computer vision applications such as autonomous vehicles, face recognition, and medical imaging technology will need a database of images. This list contains a diverse set of applications that will prove useful.
VisualQA: If you have an understanding of vision and language, this dataset is useful as it contains complex questions pertaining to over 265,000 images.
Labelme: This dataset for machine learning is already annotated, making it primed and ready for any computer vision application.
ImageNet: The go-to machine learning dataset for new algorithms, this dataset is organized in accordance with the WordNet hierarchy, meaning that each node is actually just tons of images.
Indoor Scene Recognition: This highly-specified dataset contains images that are useful to scene recognition models.
Visual Genome: Over 100K highly-detailed and captioned images.
Stanford Dogs Dataset: Great for the dog lovers among us, this dataset contains over 20,000 images of over 120 different dog breeds.
Google’s Open Images: Over 9 million URLs to images annotated across 6,000 categories.
COIL-100: Contains 100 objects that are imaged across multiple angles for a full 360 degree view.
CIFAR-10: The CIFAR-10 dataset consists of 60000 32×32 colour images in 10 classes, with 6000 images per class. There are 50K training images and 10K test images.
Cityscapes: Cityscapes contains high-quality pixel-level annotations of 5,000 frames in addition to a larger set of 20,000 poorly annotated frames.
IMDB-Wiki: Over 500K+ face images are in this dataset that have been gathered across both IMDB and Wikipedia.
Fashion MNIST: This is a dataset of Zalando’s article images. It contains a training set of 60,000 examples and a test set of 10,000 examples.
MS COCO: This dataset contains photos of various objects, and contains over 2 million labelled instances across 300K+ images.
MPII Human Pose Dataset: This dataset includes 25K images containing over 40K people with annotated body joints. It’s perfect for evaluation of articulated human pose estimation.
Sentiment Analysis Datasets for Machine Learning
There are countless ways to improve any sentiment analysis algorithm. These large, highly-specialized datasets can help.
Multi-Domain Sentiment Analysis Dataset: A treasure trove of positive and negative Amazon product reviews (1 to 5 stars) for older products.
Amazon Product Data: Featuring 142.8 million Amazon review datasets, this SA dataset features reviews aggregated on Amazon between 1996 and 2014.
Twitter US Airline Sentiment: Twitter data on US airlines dating back to February of 2015 that’s already been classified based on sentiment class (positive, neutral, negative).
IMDB Sentiment: This smaller (and older) dataset is perfect for binary sentiment classification, and features over 25,000 movie reviews.
Sentiment140: One of the more popular datasets that contains over 160,000 tweets that have been vetted for emoticons (that were subsequently removed).
Stanford Sentiment Treebank: Dataset containing over 10,000 Rotten Tomatoes HTML files with sentiment annotations based on a 1 (negative) and 25 scale (positive).
Paper Reviews: This dataset is composed of English and Spanish language reviews around computing and informatics. The dataset is evaluated using a five-point scale with -2 being the most negative and 2 being the most positive.
Lexicoder Sentiment Dictionary: This dictionary is designed to be used in accordance with the Lexicoder, which aids in the automated coding of news coverage sentiment, legislative speech, and other text.
Sentiment Lexicons for 81 Languages: This dataset contains over 81 exotic languages with positive and negative sentiment lexicons, with the sentiments analyzed and build on English sentiment lexicons.
Opin-Rank Review Dataset: This car dataset features a range of reviews around models manufactured between 2007 and 2009. It also features hotel review data.
Natural Language Processing Datasets
The following list contains diverse datasets for various NLP processing tasks including voice recognition and chatbots.
Enron Dataset: Folder-organized senior management email data from Enron.
UCI’s Spambase: A juicy spam dataset that’s perfect for spam filtering.
Amazon Reviews: Yet another treasure trove containing 35 million Amazon reviews across 18 years featuring product reviews, user information, and even the plaintext view.
Yelp Reviews: 5 million Yelp reviews in an open dataset.
Google Books Ngrams: This library of words is plenty for any NLP algorithm.
SMS Spam Collection in English: Over 5500 spam SMS messages (in English).
Jeopardy: Over 200,000 questions from the classic quiz show.
Gutenberg eBooks List: An annotated list of Project Gutenberg’s ebooks.
Blogger Corpus: A bevvy of blogs (600K+) with a minimum of 200 occurrences in each of the most commonly used English words.
Wikipedia Links Data: Over 1.9 billion words across 4 million articles, this dataset contains the entirety of Wikipedia’s text.
Datasets for Autonomous Vehicles
Autonomous vehicles require large amounts of top-notch quality datasets to interpret their surroundings and react accordingly.
Berkeley DeepDrive BDD100K: This self-driving AI dataset is considered the largest of its kind. It features over 100,000 videos of 1,100-hour drives across different time, weather, and driving conditions.
Comma.ai: Dataset featuring 7 hours of highway driving that also details the car’s GPS coordinates, speed, acceleration, and steering angles.
Oxford’s Robotic Car: Oxford, UK dataset featuring 100 repetitions of a single route across different times of day, weather, and driving conditions (traffic, weather, pedestrians).
LISA: Laboratory for Intelligent & Safe Automobiles, UC San Diego Datasets: Dataset featuring information around traffic signs, vehicles detection, traffic lights, and trajectory patterns.
Cityscapes Dataset: A diverse set of street-scene data across 50 different cities.
Baidu Apolloscapes: This dataset features 26 different semantic items including street lights, pedestrians, buildings, bicycles, cars, and more.
Landmarks: Open-sourced Google dataset designed for distinguishing between natural formations and man-made landmarks. This dataset features over two million images across 30 thousand landmarks around the world.
Landmarks-v2: As image classification technology improves, Google decided to release another dataset to help with landmarks. This even larger dataset features five million images featuring more than 200 thousand landmarks across the world.
PandaSet: PandaSet is working to promote and advance autonomous driving and ML R&D. This dataset features 48,000+ camera images, 16,000+ LiDar sweeps, 100+ scenes of 8s each, 28 annotation classes, 37 semantic segmentation labels, and spans across the full sensor suite.
nuScenes: This large-scale dataset for autonomous vehicles utilizes the full sensor suite of an actual self-driving car on the road. This vast dataset features 1.4M camera images, 390K LiDar sweeps, intimate map information, and more.
Open Images V5: This dataset consists of 9M+ images that have been annotated and labeled across thousands of object categories.
Waymo Open Dataset: This open-sourced, high-quality multimodal sensor dataset is extracted from Waymo self-driving vehicles across a diverse set of environments.