Datasets of relevance

Recipe Overview
Reading Time
15 minutes
Executable Code
Datasets of relevance
FAIRPlus logo
Recipe Type
Maturity Level & Indicator
hover me Tooltip text

Main Objectives

The FAIR cookbook aims to provide hands-on, practical advice on how to deliver FAIR data through interactions with Innovative Medicine Initiative projects. These research projects, by nature often involve patient-centric information but dealing with real-world data and human-centric information, clinical data, in particular, is challenging. It most often mandates interacting with DACs, i.e. Data Access Committees, and undergoing a vetting process, which can be lengthy and convoluted. This can become a hindrance if the focus of the work is to deliver training on the computational methods available to deal with such data rather than data custody-related tasks, however important these are.

This FAIR cookbook recipe aims to provide a list of relevant resources belonging to the realm of clinical data so readers can, with the minimal hassle :

  • familiarize with the data types (for instance, how do Electronic Health records look like).

  • familiarize with the procedures to gain access to sensitive data.

  • obtain datasets with which to work and hone computational skills.

The recipe will cover two types of datasets:

  • real datasets such as the MIMIC-III dataset 2, which corresponds to actual medical notes data for which data access requests must be made but which are made available to computational scientists for research purposes.

  • synthetic datasets, which are available without restrictions since produced by computational methods and independent of any real patient. While handy, this type of data may come with a number of limitations prospective users need to be aware of.

Electronic Health Records: The MIMIC-III Critical Care Database

Electronics Medical Notes: The EBM NLP

Synthean Electronic Health Records

One of the main bottlenecks for data miners is the lack of dataset availability of electronic health records, due to, as we saw it to HIPAA concerns. To bypass these roadblocks, several tools have been developed to generate synthetic datasets, free of any restrictions. Below, we provide information about one such tool.

Synthetic Electronic Medical Notes: the OMOP CDMv5 Test Data

Clinical Trial Data in CDISC SDTM format:

Observational Data in OMOP CDM format:


This content provides you with a set of resources to kick start your exploration of unstructured text in clinical context. These are useful resources for gaining familiarity with these data types. Remember to understand the data stewardship requirements that go along with handling real clinical data but also the limitations associated with some synthetic datasets.