Biomedical datasets of relevance for training in FAIRification

Recipe Overview
Reading Time
15 minutes
Executable Code
Training for FAIRification with open or synthetic biomedical datasets
FAIRPlus logo
Recipe Type
Maturity Level & Indicator
not applicable
hover me Tooltip text

Main Objectives

This recipe aims to provide example clinical datasets to allow users to get familiar with the process of handling clinical datasets and develop related computational tools while minimizing the challenges of accessing real-world human data.

The FAIR cookbook aims to provide hands-on, practical advice on how to deliver FAIR data through interactions with Innovative Medicine Initiative(IMI) projects. These research projects, by nature, often involve patient-centric information. But dealing with real-world data and human-centric information, clinical data, in particular, is challenging. It most often mandates interacting with Data Access Committees (DACs), and undergoes a vetting process, which can be lengthy and convoluted. This can become a hindrance if the focus of the work is to deliver training on the computational methods available to deal with such data rather than data custody-related tasks, however important these are.

This recipe aims to provide a list of relevant resources belonging to the realm of clinical data so readers can, with minimal hassle :

  • familiarize with clinical data types, such as Electronic Health Records(EHR).

  • familiarize with the procedures to gain access to sensitive data.

  • obtain datasets with which to work and hone computational skills.

The recipe will cover two types of datasets:

  • real datasets, such as the Medical Information Mart for Intensive Care III(MIMIC-III) dataset 2, which corresponds to actual medical notes data for which data access requests must be made but which are made available to computational scientists for research purposes.

  • synthetic datasets, which are available without restrictions since produced by computational methods and are independent of any real patient. While handy, this type of data may come with a number of limitations prospective users need to be aware of.

Electronic Health Records: The MIMIC-III Critical Care Database

Electronics Medical Notes: The EBM NLP

Electronics Medical Notes: SynPUF 1000 person dataset & OMOP-CDM v5 standard

Synthean Electronic Health Records

Clinical Trial Data in CDISC SDTM format:


This content provides you with a set of resources to kick start your exploration of unstructured text in clinical context. Remember to understand the data stewardship requirements that go along with handling real clinical data but also the limitations associated with some synthetic datasets.