Open datasets are the lifeblood of any machine learning model. Machine learning’s application within the life science, healthcare, and medical fields are already proving incredibly fruitful in the areas of predicting diseases and analyzing transmission of diseases. Machine learning is also suggesting ways we can adequately care for the sick, aging, and ill members of a given population.
Build your own proprietary medical dataset. Get a quote for an end-to-end data solution to your specific requirements.
Talk with an expertIn this piece we at iMerit have compiled life science, medical, and healthcare datasets for your machine learning needs.
General Life Sciences, Healthcare and Medical Datasets
Life Science Database Archive: This life science dataset was generated in Japan over a long timespan. The archive has a unified metadata format, which makes it easier for users to search for datasets and access and download them.
HealthData.gov: The official website of the United States Government features tons of datasets that are meant to aid in the tracking and subsequent improvement of the American population. It features datasets ranging from cholesterol tracking all the way to COVID-19 data.
Medicare Provider Utilization and Payment Data: This is one of those medical datasets that focuses primarily as an information source around services and procedures provided to Medicare beneficiaries by their respective physicians. The dataset was collected between 2012 and 2018.
Chronic Disease Data: This is one of the CDC’s medical datasets for chronic disease indicators. This set of 124 indicators was collected across a variety of states, cities, and territories around a medical consensus claiming these indicators are valid.
Human Mortality Database: The Human Mortality Database features mortality, population, and other health/demographic data across 35 countries.
MHealth (Mobile Health) Dataset: Mhealth stands for mobile health. This dataset is unique among medical datasets as it tracks just ten users who wore sensors placed over their chests, right wrists, and left ankles while they performed a variety of physical activities, making it a potent body motion and vital signs dataset.
Image Datasets for Life Sciences, Healthcare and Medicine
Oasis: This dataset hails from the Open Access Series of Imaging Studies (OASIS), and aims to provide neuroimaging datasets to the public at no charge to the benefit of the scientific community. It covers 1098 subjects across 2168 MR Sessions and 1608 PET sessions.
OpenNeuro: OpenNeuro is a free and open platform that shares images pertaining to MRI, MEG, EEG, iEEG, ECoG, ASL, and PET data. It contains 563 medical datasets that cover 19,187 participants.
ADNI: The Alzheimer’s Disease Neuroimaging Initiative (ADNI) features data collected by researchers around the world that are working to define the progression of Alzheimer’s disease. The data featured includes MRI and PET images, genetics, cognitive tests, CSF and blood biomarkers.
Genome Datasets
Genome in a Bottle: This dataset includes reference genomes that will enable translation of the entire human genome sequencing to clinical practice.
GEO Datasets: GEO medical datasets stores curated gene expression DataSets along with original Series and Platform records around the Gene Expression Omnibus (GEO) repository.
1000 Genomes Project: The international collaboration that makes up the 1000 Genomes Project is considered to be one of the most detailed catalogues of human genetic variation. It intimately details human genetic variation, including SNPs, structural variants, and their haplotype context.
Hospital Datasets
Healthcare Cost and Utilization Project (HCUP): This nationwide database was designed for the purposes of identifying, tracking, and analyzing any and all national trends relating to healthcare utilization, access, charges, quality, and outcomes. Each of the medical datasets contains encounter-level information on all patient stays along with emergency department visits and ambulatory surgery in US hospitals.
MIMIC Critical Care Database: This openly available medical dataset was developed by MIT for the uses of Computational Physiology. The MIMIC dataset comprises unidentified health data relating to over 40,000 critical care patients.
Medicare Hospital Quality: The official medical datasets of Medicare.org that were originally provided by the Centers for Medicare and Medicaid services, these datasets empower users to compare and evaluate the results and quality of care from over 4,000 Medicare-certified hospitals.
Cancer Datasets
CT Medical Images: This dataset featuring cancer-patient CT scans was designed to enable alternative methods for examining trends in CT image data around contrast, modality, and patient age.
Broad Institute Cancer Program Datasets: This dataset features CT scans of cancer patients. Classifications include tumor types, gene expression patterns, multi-class molecular cancer classification, and more.
SEER Cancer Incidence: This US-government provided cancer data is segmented using basic demographic distinguishers such as race, gender, and age.
International Collaboration on Cancer Reporting (ICCR): The medical datasets within the ICCR have been developed and provided with the end-goal of providing an evidence-based approach to all cancer reporting.
Lung Cancer Data Set: This free dataset features information relating to lung cancer going all the way back to 1995.