䷉Table: Biomedical and Clinical Datasets (mostly for NLP)

Send your thoughts via twitter or mail

A sporadically updated table of datasets, in random order, I come across during my research. I didn’t work with most of them, so I can’t vouch for data quality of most entries.

Table published from Roam through ivywrite



MedType Dataset

Authors create WikiMed (650k mentions normalized to UMLS) and PubMedDS dataset]] and pre-train MedType with it before fine-tuning on medical data

Dataset from A global network of biomedical relationships derived from text

. We have also provided the full set of dependency paths connecting biomedical entities in Medline abstracts, with associated sentences

one of my favorite approaches to relation extraction

Therapeutic Drug Database


a large-scale cloze-style biomedical MRC dataset


Hetionet is an integrative network of biomedical knowledge assembled from 29 different databases of genes, compounds, diseases, and more. The network combines over 50 years of biomedical information into a single resource, consisting of 47,031 nodes (11 types) and 2,250,197 relationships (24 types).

Figure 1. Visualization of the heterogeneous biomedical network Hetionet3.

sounds good but know nobody actually using this

Detecting Scientific Claim

We release the dataset of annotated 1,500 abstracts containing 11,702 sentences (2,276 annotated as claim sentences) sampled from 110 biomedical journals.


Dependencies, genetic, cancer etc.

INDRA uses some of it for DepMap Explainer


annotated. 1500~ sentences

Synonyms for [[NER]] pretrained on NCBI from paper: BioSyn

Our model-based candidates are iteratively updated to contain more difficult negative samples as our model evolves. In this way, we avoid the explicit pre-selection of negative samples from more than 400K candidates. On four biomedical entity normalization datasets having three different entity types (disease, chemical, adverse reaction), our model BIOSYN consistently outperforms previous state-of-the-art models almost reaching the upper bound on each dataset.

Seems popular for very short lifespan

Causal TimeBank. CausalTB


Grounding Protein Relations (isa, part-of)

FEVER: a large-scale dataset for Fact Extraction and VERification

PubMed Downloads


DocRED: A Large-Scale Document-Level Relation Extraction Dataset


dataset and pipeline for [[entailment recognition]]

BioPortal as a dataset of linked biomedical ontologies and terminologies in RDF

Mimic Code

MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.

The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital).

Causal BioNet CBN

National Library of Medicine

AllenAI datasets and demos


gold-standard [[database]] mapping the effect of genetic mutations on treatments


CancerMine, a text-mined and routinely updated database of drivers, oncogenes and tumor suppressors in different types of cancer. All data are available online

PubTator Central

[[Entity Normalization]] and recognition works well. Downloadable in full via FTP! ...30GB yumyum!

S2ORC: Semantic Scholar Open Research Corpus

S2ORC is a large contextual citation graph of English-language academic papers from multiple scientific domains; the corpus consists of 81.1M papers, 380.5M citation edges, and associated paper metadata. We provide structured full text for 8.1M open access papers.

Used Grobid for PDF parsing. They also parsed the LaTex!

License tricky for commercial products


Never understood what this was supposed to be

And as always when the EU tries to coordinate: total failure

BioNLP shared tasks test sets

NCBI Blue Benchmark

Biomedical Language Understanding Evaluation benchmark

BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora.

NacTem corpi 💫

links to corpora of various sizes, with different levels of annotation, and belonging to different domains.

BioNLP data NCBI

collection resources

Pubmed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts](https://github.com/Franck-Dernoncourt/pubmed-rct)

The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion.

University of Darmstadt | Linguistic Datasets


Covid Open Data Initiatives

lots of good training data and compound library

BioRel Corpus

BioRelEx 2000 annotated sentences

We present BioRel, a large-scale dataset constructed by using Unified Medical Language System (UMLS) as knowledge base and Medline as corpus. Entities in sentences of Medline are identified and linked to UMLS by Metamap. Relation label for each sentence is recognized using distant supervision.

AIMed Corpus

The AImed corpus consists of 225 Medline abstracts. 200 abstracts describe interactions between human proteins, 25 do not refer to any interaction. There are 4084 protein references and around 1000 tagged interactions in this data set. In this data set there is no distinction between genes and proteins and the relations are symmetric.


This Before That

Another Dataset for causal tagging

🎉 List of biomedical datasets and tools

Great Collection

Genia Corpus

well known corpus

categories in Genia database


Very structured but unfortunately no set for biomedicine

PGxCorpus Browsable Online

manually annotated.

Relationship Types


“What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines”

Semantic Relationship Datasets

A collections of public and free annotated datasets of relationships between entities/nominals

BioIE Curated List of Resources 💫 🚀

List of resources and datasets!

Exciting Developments


NDEx is a network sharing and versioning website with a programmatic API for accessing networks

Reactome Pathway Browser

Reactome is a free, open-source, curated and peer-reviewed pathway [[database]].

🙌 Drug Knowledge Databases

UK Biobank

in-depth genetic and health information from half a million UK participants

[[range restricted]] genetic data (taken after biopsy (dead)” - anonymous friend

Reach Datasets

Collection of datasets


The Semantic MEDLINE Database (SemMedDB) [1] is a repository of semantic predications (subject-predicate-object triples) extracted by SemRep, a semantic interpreter of biomedical text [2]. SemMedDB currently contains information about approximately 94.0 million predications from all of PubMed citations (about 27.9 million citations, as of December 31 2017) and forms the backbone of the Semantic MEDLINEapplication

Google Dataset Search


often interesting finds, but very low recall, almost none of the interesting datasets for bioNLP would show up


210M open access research papers


A screening library of 12,000 molecules assembled by combining three databases (Clarivate Integrity, GVK Excelra GoStar and Citeline Pharmaprojects) to facilitate drug repurposing