Data Resources

Data Consortia

Data consortia and related resources for ML training.

ML Training Data Sources

wdt_ID wdt_last_edited_at Category Consortium / Resource Acronym / Affiliation URL Description Access / Pricing Status (verified Jun 2026) Comments
1 17/06/2026 07:01 PM AI / Drug Discovery Partnership Accelerating Therapeutics for Opportunities in Medicine (ATOM) ATOM https://atomscience.org Public-private partnership using AI and high-performance computing to accelerate drug discovery from target ID to clinical candidate. Large-institution collaborators; no public pricing; access via industry partnership. Active Government-funded AI drug discovery platform (not an open data source).
2 17/06/2026 07:01 PM AI / Healthcare Initiative AI4Health Consortium AI4Health https://www.ai4health.eu European initiative promoting AI in healthcare and drug discovery; integrates data, modeling, and AI solutions. Publicly/privately funded; project-based participation, not open membership. Verify directly Several EU 'AI4Health' branded efforts exist; confirm the specific entity before relying on it.
3 17/06/2026 07:01 PM Population / Health Dataset All of Us Research Program All of Us https://allofus.nih.gov U.S. NIH program building one of the largest, most diverse health databases (health, genomic, environmental data) for precision medicine. Free; registration and approval required for Researcher Workbench data access. Active Strong real-world + genomic training data. Researcher Workbench is cloud-based.
4 17/06/2026 07:01 PM Biomarkers Biomarker Enterprise to Advance Personalized Medicine (BEAM) BEAM https://www.personalizedmedicinecoalition.org Collaboration fostering biomarker discovery, development, validation, and standardization across the life sciences. Operates under the Personalized Medicine Coalition; membership varies by org type. Verify directly Confirm program is still active under PMC; biomarker programs change names frequently.
5 17/06/2026 07:01 PM Regulatory Science / Data Standards Critical Path Institute C-Path https://c-path.org Non-profit accelerating drug development via data standards and public databases; biomarkers, clinical trial modeling, regulatory science. Grant- and sponsor-funded consortia; join individual consortia rather than a general membership. Active Operates many disease-specific data consortia worth exploring individually.
6 17/06/2026 07:01 PM Bioinformatics Infrastructure ELIXIR ELIXIR https://elixir-europe.org European intergovernmental org uniting life-science tools, data, and standards for open access; integrates national bioinformatics infrastructure. Membership for academia and industry; variable fees by service. Active Umbrella over many core EU resources (e.g., EBI databases).
7 17/06/2026 07:01 PM Genomics Dataset ENCODE (Encyclopedia of DNA Elements) ENCODE https://www.encodeproject.org Collaborative project identifying all functional elements in the human genome; large-scale datasets and analysis tools for gene regulation. Free public access to all data. Active Well-curated, ML-friendly functional genomics data.
8 17/06/2026 07:01 PM Biomarkers / Neuro ERP Biomarkers ERP Biomarkers https://erpbiomarkers.org Identification/validation of electrophysiological (ERP) biomarkers, especially for neuropsychiatric conditions. Membership for research institutions and pharma; fees by org type. Verify directly Niche; confirm site is live and program active before citing.
9 17/06/2026 07:01 PM Biomarkers FNIH Biomarkers Consortium Biomarkers Consortium (FNIH) https://fnih.org/our-programs/biomarkers-consortium/ Public-private partnership identifying and qualifying biomarkers for drug development and precision medicine across diseases. Partner-funded; no general membership/pricing. Active URL updated to current FNIH program path.
10 17/06/2026 07:01 PM Rare Disease / Patient Registries Genetic Alliance Genetic Alliance https://geneticalliance.org Consortium linking patient registries to enable research and drug discovery from rare disease patient communities. Membership for patient orgs and research institutions; variable fees. Active Registry data; consent/governance constraints apply for ML use.
11 17/06/2026 07:01 PM Population / Genomics Dataset Genomics England (100,000 Genomes Project) Genomics England https://www.genomicsengland.co.uk UK initiative sequencing genomes from NHS patients (rare disease, cancer); datasets available for approved research. Access via approved research applications (Research Environment); no public price list. Active 100K project complete; now part of broader NHS genomics + larger newborn/diverse-data programs.
12 17/06/2026 07:01 PM Standards / Data Sharing Global Alliance for Genomics and Health (GA4GH) GA4GH https://www.ga4gh.org International effort developing frameworks and standards for sharing genomic and health data globally. Free to participate; some collaborations may involve fees. Active Standards body (not a dataset) — important for interoperable ML pipelines.
13 17/06/2026 07:01 PM Cloud Data Platform Google Cloud Public Datasets (BigQuery) Google https://cloud.google.com/datasets Public datasets hosted on Google Cloud, queryable in BigQuery and integrable into tools. Free to query within BigQuery free tier; compute billed beyond that. Active URL updated to current public-datasets hub (old /bigquery/public-data path redirects).
14 17/06/2026 07:01 PM Cloud Data Platform Google Cloud Life Sciences Public Datasets Google https://cloud.google.com/batch Formerly hosted curated life-science public datasets and the Cloud Life Sciences API for pipelines. N/A — service retired. DEPRECATED Cloud Life Sciences API was deprecated and removed after July 8, 2025; Google directs users to Cloud Batch. Recommend removing or replacing this entry.
15 17/06/2026 07:01 PM Single-Cell / Atlas Human Cell Atlas (HCA) HCA https://www.humancellatlas.org Global consortium building a reference map of all human cells; cutting-edge single-cell datasets. Free data access; membership for collaboration. Active Data Portal at data.humancellatlas.org. Large single-cell training corpus.
16 17/06/2026 07:01 PM Data Platform / Competitions Kaggle Kaggle https://www.kaggle.com Platform to explore, analyze, and share datasets; hosts competitions and some synthetic datasets for training. Free; now part of Google. Active Hosted several major life-science ML competitions (e.g., BELKA from Leash Bio, 2024).
17 17/06/2026 07:01 PM AI / Federated Learning MELLODDY Consortium MELLODDY https://www.melloddy.eu EU IMI project that used federated machine learning across 10 pharma companies' private datasets to improve predictive drug-discovery models. Was EU Horizon 2020 + industry funded; no public membership. Concluded (legacy) Project ended ~2022; site/tooling (MELLODDY-Tuner, MELLODDY-TUDA) now mainly a methods reference. Federated-learning blueprint, not a live data source.
18 17/06/2026 07:01 PM Database Index / Reference NAR Database Issue & Online Molecular Biology Database Collection Nucleic Acids Research https://www.oxfordjournals.org/nar/database/c/ NAR's annual curated catalog of molecular biology databases by category (1,900+ databases). Freely available on the NAR website. Active Best single index for discovering domain databases; refreshed annually in the NAR Database Issue.
19 17/06/2026 07:01 PM Chemistry / Reactions Dataset Open Reaction Database ORD https://docs.open-reaction-database.org/en/latest/ Open-access chemical reaction database supporting ML for reaction prediction, synthesis planning, and experiment design. Open source / open access. Active Led by Connor Coley (MIT).
20 17/06/2026 07:01 PM Industry Alliance / Standards Pistoia Alliance Pistoia Alliance https://www.pistoiaalliance.org Global non-profit lowering barriers to life-science R&D innovation: digital transformation, AI, real-world data, interoperability. Org membership; fees by level/size. Individual membership historically ~$250. Active Offers many free virtual meetings on data topics. Good networking/standards angle.
Category Consortium / Resource Acronym / Affiliation URL Description Access / Pricing Status (verified Jun 2026) Comments