11.5.3. Bioactivity data profile

Recipe Overview
Reading Time
30 minutes
Executable Code
Outlining a metadata profile for Bioactivity data
FAIRPlus logo
Recipe Type
Maturity Level & Indicator
hover me Tooltip text Main objective

This recipe shows how to prepare bioactivity data, defined as the measurable effects of a chemical compound in a biological system monitored with a specific assay, to meet the ChEMBL submission criteria, focusing on data formats, structures, and vocabularies. This recipe is meant to address the Findability and Interoperability of such type of data. Graphical overview of the Recipe FAIRification Objectives Introduction

Bioactivity data, as stored in public archives such as the European repository CHEMBL or its US counterpart PubChem in together with chemical data and omics data, can be used to search for new hits(compounds with desired property in drug screening), for example by using cell line information, compound ID as input to queries over such resources.

Early-stage bioactivity dataset includes compound molecular structure, molecular production details, assay data and, pharmacokinetic study information.

The FAIR principles for data management can guide the improvements of pharmacokinetic properties of compounds and the identification of drug targets by enhancing the reporting of bioactivity data.

Among the FAIR principles, the use of  rich metadata (F2. data are described with rich metadata and R1. meta(data) are richly described with a plurality of accurate and relevant attributes) and the reliance on community standards (R1.3. (meta)data meet domain-relevant community standards) are essential.

In the context of bioactivity data, we have on the one hand the Minimum information about a bioactive entity (MIABE) checklist recommend attributes, formats and vocabularies for the reuse of such datasets.

On the other hand, public bioactivity data archives, such as ChEMBL, PubChem, and ECBD also have their own requirements for data submission. Data content

Content Details Data types
Chemistry (SDF) Structure ID
  • SDF
  • InChI
  • CID
Target Protein/GENE ID PN_ or SwissProt ID
Assay Typology Binding, FRET, SPR, Inhibition, phenotypic cellular
Result Type Potency/Tox CC50/IC50/EC50/%
Unit Result unit Concentration/ratio/SI
  • Matrix Format-Zarr Minimum metadata

A minimum metadata set represents a collection of metadata items that should ideally be systematically supplied to support interpretation by humans or machines within a specific domain, for instance bioactivity experimental data. The minimum metadata set includes three parts:

  1. Assay and project bibliographic references (mainly links to literature and protocol or summary)

    • Project level metadata

    • Common sample-level metadata, such as species, tissue, cell type and so on.

  2. Chemical compounds reference, including chemical structures

  3. Assay results

For ChEMBL submission, molecular structures and assay description as depicted in the scheme above are suggested as essential metadata. This is a subset of the following schema. In case mutated cell lines and/or mutated target proteins have been used in the assay, additional desirable metadata should be added in the proper group. MIABE also lists detailed bioassay description requirements.

Besides metadata, the diagram below also shows how to prepare numeric assay data. Data vocabularies

A set of well-established standards and minimum metadata checklists exist for various aspects of ChEMBL formatting.

  • Chemical information ontology (CHEMINF) http://semanticchemistry.github.io/semanticchemistry/ontology/cheminf.owl

    CHEMINF covers information about chemical entities and defines descriptors commonly used in cheminformatics software applications and to denote algorithms used to generate those chemicals.

  • BioAssay Ontology(BAO)


    The BioAssay Ontology (BAO) describes biological screening assays and their results, including high-throughput screening (HTS) data for the purpose of categorising assays and data analysis. BAO is an extensible, knowledge-based, highly expressive description of biological assays 1 making use of descriptive logic based features of the Web Ontology Language (OWL)

  • Ontology of units of Measure (OM) http://www.ontology-of-units-of-measure.org/resource/om-2 The OM ontology provides classes, instances, and properties that represent the different concepts used for defining and using measures and units. It includes, for instance, common units such as the SI units meter and kilogram, and a wide range of units of significance for the field of Chemistry and related information. It can be easily mapped to other resources such as Unit Ontology, with tools such as OXO

More information on annotating data with ontologies using tools like Zooma, can be found in Section of this recipe Exemplar Bioactivity datasets

SARS CoV2 phenotypic assay from Caco2 cell line

The present dataset is a subset of IMI CARE dataset with compounds tested on the Caco-2 cell line. The dataset can be downloaded and, besides structural information, it will contain readout numbers for activity (e.g. either percentage of cellular cytopathic inhibition at a given concentration or corresponding extracted dose-response IC50 (Half-maximal inhibitory concentration)).

Recommendations above are based on ChEMBL ontology requirements. The US counterpart to ChEMBL, the PubChem data bank have different ontology requirements for upload but provide a wizard-based upload process described in this blog Glossary




Biochamical Assay, Cellular Activity Assay, Cellular Toxicity Assay


Quantitive measurements of a biophysical event followed by assay (e.g. change in fluorescence)


Half maximal Effective Concentration


Half maximal Inhibition Concentration


Half maximal Activation Concentration


Half maximal Cytotoxic Concentration References Authors Licence