Post

26 AI/ML Datasets & Search Engines You Can Use Right Now

September 20, 2021

Looking for some AI/ML datasets? You’ve come to the right place. In an effort to facilitate your search for ground truth data and open ML datasets for your machine learning project, we’ve hand-picked a number of dataset search engines, aggregators, and repositories, categorized as follows:

  1. Dataset aggregators and search engines: One stop-shops for finding applicable datasets
Dataset aggregators and search engines
  1. Industry and application-specific datasets: Curated list of datasets for specific use cases
  2. Government bodies: Datasets provided by nationwide authorities
  3. International organizations: Datasets whose metrics span across nations and geographies
  4. Education: Leading educational institutions that provide ML training resources
Leading educational institutions that provide ML training resources

ML dataset aggregators and search engines

These search platforms collect thousands of datasets from leading institutions using simple queries. When doing research around the available datasets for your projects, these should be your go-to platforms, which will most likely have the highest chance of pointing you to the right dataset for your needs.

  1. Kaggle: An all-in-one platform for machine learning enthusiasts, Kaggle boasts an impressive portfolio of community-driven datasets which can be filtered by application, tags, and popularity. Besides the extensive datasets, Kaggle also has a wide variety of competitions, courses, sample code, and forums.
Kaggle boasts an impressive portfolio of community-driven datasets
  1. Google Dataset Search: Unsurprisingly provided by Google, Dataset Search pulls together datasets published by reputable organizations such as the World Health Organization, Statista, and Harvard. These Datasets are stored in the Google Cloud Platform and can be examined with the BigQuery tool. To work with these datasets, users must have a Google Cloud Platform account. 
  2. Registry of Open Data on AWS: AWS enables users to analyze data shared on AWS and build services on top of it using a wide range of AWS compute and data analytics products including Amazon EC2, Amazon Athena, AWS Lambda, and Amazon EMR. Registered users can access and download data for free, but the compute and analytics tools are charged at AWS’ standard rate
  3. Data World: A cloud-hosted data catalog with entries contributed by thousands of users and organizations across the world. A data catalog is a metadata management tool that is used to inventory and organize the data within their systems.
  4. DataCite: Powered by the merger between re3data and Databib, DataCite serves the research community with a single, sustainable registry of research data repositories. It is an international not-for-profit organization which aims to improve data citation in order to establish easier access to research data on the Internet, and increase acceptance of research data as legitimate.

Industry and application-specific ML datasets

iMerit has curated a list of leading free dataset repositories for industry-specific datasets. Each of the sources shared below contain a list of repositories for you to explore potential datasets.

  1. Facial Recognition: This list of multiple databases contains faces, annotated video frames of facial keypoints, fake faces paired with real ones, and more. This is a great resource for any face-oriented computer vision applications such as personal device security, criminal justice, and even augmented reality.
  2. Computer Vision: Taking a step back from facial recognition, this list contains multiple computer vision datasets made up of a diverse range of images, including video sequences, multiple camera angles around the same subject, and even multi-dimensional medical scanner data.
  3. Finance and Economy: This is a comprehensive list showcasing a variety of datasets from high-level macroeconomic data to stock market weekly returns.
  4. Stock Market: Doubling down on the financial side of open datasets, these are our ten picks around stock market and cryptocurrency datasets for building machine learning models.
  5. Crime: These datasets leverage publicly available information, mainly across the US, to help you analyze crime rates, or assess crime trends for specific areas.
  6. Chatbot: This is a list of the most successful and commonly-used datasets for training a chatbot. Each of the entries on this list contains relevant data including customer support data, multilingual data, dialogue data, and question-answer data.
  7. Natural Language Processing: Built using our expertise, iMerit have compiled this list of our top NLP datasets that we’ve used in the past to help clients optimize their NLP projects. These sources on this list range from sentiment analysis to audio and voice recognition projects.
  8. Life Science, Medical and Healthcare: These datasets are a great choice for training algorithms for identifying the early onset of diseases and recommending treatment

Government bodies ML datasets

While most countries provide an open dataset that covers details around the economy, environment, infrastructure and health, we have compiled a list of English-first governmental datasets.

  1. Australian Government Dataset: Data is provided by local councils, as well as larger entities such as Geoscience Australia, Bureau of Mineral Resources, CSIRO Oceans & Atmosphere.
  2. New Zealand’s Government Dataset: Over 31,000 datasets available from organizations such as Land Information, Ministry of Environment, Ministry for Primary Industries and more.
  3. Singapore Government Dataset: Datasets on the city-state’s areas such as Economy, Education, Environment, Finance, Health, Infrastructure, Society, Technology, Transport.
Singapore Government Dataset
  1. US Gov Data: With an impressive 320,000 datasets, the US Gov Data website allows you to find datasets by categorizing across federal government, city government, state government, universities and country governments.
  2. Canada Government Dataset: The Canadian dataset portal contains information on areas such as Agriculture, Economics and Industry, Education and Training, Government and Politics, Information and Communications, and more.
  3. EU Open Data Portal: A catalogue of datasets from all European countries, including members of the EU, EU institutions, as well as non-members such as the UK and Switzerland.

International organizations ML datasets

Compared to the government-provided datasets, these International organizations publish data spanning across multiple countries depending on the organization’s mission.

  1. UNICEF: The United Nations Children’s Fund provides datasets on topics such as Climate change, Gender equality, Malnutrition, Sustainable Development Goals.
United Nations Children’s Fund provides datasets
  1. WHO: The World Health Organization has datasets on types of health-related topics, ranging from high level topics such as a Mental Health Atlas Country profiles, or low level details such as South-East Asia Regional Microdata Repository.
  2. IMF: The International Monetary Fund publishes data on international finances, debt rates, foreign exchange reserves, commodity prices and investments.
  3. UNESCO: The UNESCO Institute for Statistics (UIS) is the official source of internationally-comparable data on education, science, culture and communication. The the UIS produces a wide range of state-of-the-art databases to fuel the policies and investments needed to transform lives and propel the world towards its development goals. The UIS provides free access to data for all UNESCO countries and regional groupings from 1970 to the most recent year available.
  4. WTO: The World Trade Organization publishes reports on anything trade related such as merchandise trade and trade in services statistics, market access indicators, non-tariff information as well as other indicators.

Education ML datasets

These universities have been prominent figures in the development of machine learning. They provide research-oriented databases available for anyone to use.

  1.  UCI Machine Learning Repository: The University of California Irvine offers 507 datasets that cover bank marketing, car evaluation, lung cancer diagnosis, and many other different subjects.
UCI Machine Learning Repository
  1. CMU Libraries: Carnegie Mellon University has its own collection of public datasets that you can use for your own research. There you will find insightful databases about American culture, music, and history that other aggregators don’t provide.

Labeling datasets quickly and cost-efficiently

Even though endless amounts of data are available for free from the above resources, a supervised learning algorithm cannot make sense of it without adequate labeling. To help you address the challenge of accurately labeling your datasets, iMerit provides a wide range of labeling services such as computer vision and NLP, and leverages a highly-specialized workforce with industry-specific knowledge. If data has you down, then let iMerit solve your data challenges today.