Biomedical datasets of relevance for training in FAIRification¶

Recipe Overview

Reading Time

15 minutes

Executable Code

No

Difficulty

Training for FAIRification with open or synthetic biomedical datasets

Recipe Type

Guidance

Audience

Principal Investigator, Data Manager, Terminology Manager, Data Scientist

Maturity Level & Indicator

not applicable

Cite me with FCB069

Main Objectives¶

This recipe aims to provide example clinical datasets to allow users to get familiar with the process of handling clinical datasets and develop related computational tools while minimizing the challenges of accessing real-world human data.

The FAIR cookbook aims to provide hands-on, practical advice on how to deliver FAIR data through interactions with Innovative Medicine Initiative(IMI) projects. These research projects, by nature, often involve patient-centric information. But dealing with real-world data and human-centric information, clinical data, in particular, is challenging. It most often mandates interacting with Data Access Committees (DACs), and undergoes a vetting process, which can be lengthy and convoluted. This can become a hindrance if the focus of the work is to deliver training on the computational methods available to deal with such data rather than data custody-related tasks, however important these are.

This recipe aims to provide a list of relevant resources belonging to the realm of clinical data so readers can, with minimal hassle :

familiarize with clinical data types, such as Electronic Health Records(EHR).
familiarize with the procedures to gain access to sensitive data.
obtain datasets with which to work and hone computational skills.

The recipe will cover two types of datasets:

real datasets, such as the Medical Information Mart for Intensive Care III(MIMIC-III) dataset 2, which corresponds to actual medical notes data for which data access requests must be made but which are made available to computational scientists for research purposes.
synthetic datasets, which are available without restrictions since produced by computational methods and are independent of any real patient. While handy, this type of data may come with a number of limitations prospective users need to be aware of.

Electronic Health Records: The MIMIC-III Critical Care Database¶

Electronics Medical Notes: The EBM NLP¶

Electronics Medical Notes: SynPUF 1000 person dataset & OMOP-CDM v5 standard¶

Synthean Electronic Health Records¶

Clinical Trial Data in CDISC SDTM format:¶

Conclusions¶

This content provides you with a set of resources to kick start your exploration of unstructured text in clinical context. Remember to understand the data stewardship requirements that go along with handling real clinical data but also the limitations associated with some synthetic datasets.

References¶

What to read next?¶

FAIRsharing records appearing in this recipe:

Authors¶

Authors

Name	Affiliation	Contribution
Philippe Rocca-Serra	University of Oxford	Writing - Original Draft
Susanna-Assunta Sansone	University of Oxford	Writing - Review & Editing
Yojana Gaidya	Fraunhofer Institute	Writing - Review & Editing
Fuqi Xu	EMBL-EBI	Reviewing