Starting material¶

Recipe Overview

Reading Time

30 minutes

Executable Code

Yes

Difficulty

FAIRifying Data Matrices - Step1 - Starting material

Recipe Type

Hands-on

Audience

Principal Investigator, Data Manager, Data Scientist

Maturity Level & Indicator

DSM-1-C0

DSM-1-C1

Cite me with FCB038

Background:¶

Experimental results such as metabolite profiling data published in [1,2] can be straightforwardly reported using OKFN Data Packages. Such components can be easily parsed as data frames and exploiting for data visualization purpose using libraries implementing graphical grammar concepts. Here, we show how to use a set of python libraries to create a tabular data package from an Excel file, annotate it with ontologies (CHEBI, PO, NCBITax) and validate the results against the JSON definition of the data table. A few line of codes allow structure information around key study design descriptors: the independent variables and their levels have been clearly and unambiguously declared in the Tabular Data Package itself.

1. Let’s begin by installing the Python packages allowing easy access and use of data formatted as JSON Data Package¶

import os
import libchebipy
import re
import pandas as pd
from datapackage import Package
from goodtables import validate

2. Reading the data¶

We now simply read in the Excel file corresponding to the Nature Genetics Supplementary Table from the Zenodo archive

(DOI: https://doi.org/10.5281/zenodo.2598799)

#df = pd.read_excel('Supplementary Data 3.xlsx', sheet_name='Feuil1')

df = pd.read_excel('https://zenodo.org/api/files/91a610cb-8f1f-4ec5-9818-767a75a7a820/Supplementary%20Data%203.xlsx', sheet_name='Feuil1')

df.head(25)

3. Following a manual inspection of the Excel Source, getting the start row of the data, we use Pandas take() function¶

to extract first a row of headers (hence -axis set to 0)

header_treatment = df.take([13], axis=0)

4. We then extract all the columns of interest (same take() function, with -axis set to 1)¶

data_full = df.take([3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], axis=1)
# We now trim by removing the first 15 rows which contain no information
data_slice = data_full.take([16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
                             39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
                             62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76], axis=0)

5. We now rename the DataFrame automatically generated field header to something more meaningful¶

data_slice.rename(columns={"Unnamed: 3": "chemical_name",
                           "Unnamed: 4": "sample_mean_1",
                           "Unnamed: 5": "sem_1",
                           "Unnamed: 6": "sample_mean_2",
                           "Unnamed: 7": "sem_2",
                           "Unnamed: 8": "sample_mean_3",
                           "Unnamed: 9": "sem_3",
                           "Unnamed: 10": "sample_mean_4",
                           "Unnamed: 11": "sem_4",
                           "Unnamed: 12": "sample_mean_5",
                           "Unnamed: 13": "sem_5",
                           "Unnamed: 14": "sample_mean_6",
                           "Unnamed: 15": "sem_6",
                           "Unnamed: 16": "sample_mean_7",
                           "Unnamed: 17": "sem_7",
                           "Unnamed: 18": "sample_mean_8",
                           "Unnamed: 19": "sem_8"}, inplace=True)

6. We insert 2 new fields as placeholders for chemical information descriptors¶

We then reinitialize the dataframe index so row numbering start at 0, not 16

data_slice.insert(loc=1, column='inchi', value='')
data_slice.insert(loc=2, column='chebi_identifier', value='')
data_slice = data_slice.reset_index(drop=True)

7. We use LibChebi to retrieve CHEBI identifiers and InChi from a chemical name.¶

Note: in this call, we retrieve only values for which an exact match on the chemical name is found in Chebi libchebi API does not allow easy searching on synonyms, thus we are failing to retrieve all relevant information. This is merely to showcase how to use libchebi.

for i in range(0, 60):
    hit = libchebipy.search(data_slice.loc[i, 'chemical_name'], True)
    if len(hit) > 0:
        print("HIT: ", data_slice.loc[i, 'chemical_name'], ":", hit[0].get_inchi(), "|", hit[0].get_id())
        data_slice.loc[i, 'inchi'] = hit[0].get_inchi()
        data_slice.loc[i, 'chebi_identifier'] = hit[0].get_id()
    else:
        print("Nothing found: ", data_slice.loc[i, 'chemical_name'])
        data_slice.loc[i, 'inchi'] = ''
        data_slice.loc[i, 'chebi_identifier'] = ''

8. The following steps are needed to perform the table transformation from a ‘wide’ layout to a ‘long table’ one.¶

Prep stubnames - pick out all the feature_model variables and remove the model suffices ‘long table’. The layout is that relied on by Frictionless Tabular Data Packages and consumed by R ggplot2 library and Python plotnine library.

Step1: obtain all the different ‘dimensions’ measured for a given condition (i.e. repeating fields with an increment suffix).

feature_models = [col for col in data_slice.columns if re.match("(sample_mean|sem)_[0-9]", col) is not None]
features = list(set([re.sub("_[0-9]", "", feature_model) for feature_model in feature_models]))

Step2: invoke Pandas pd.wide_to_long() function to carry out the table transformation. See Pandas documentation for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html and the excellent blog: https://medium.com/@wangyuw/data-reshaping-with-pandas-explained-80b2f51f88d2

long_df = pd.wide_to_long(data_slice, i=['chemical_name'], j='treatment', stubnames=features, sep="_")

Apparently a feature in Pandas DataFrame causes a mismatch in the field position.¶

We solve this by writing the DataFrame to file and reading it back in again, not ideal but it does the trick. So writing to a temporary file & reading from that file to solve the issue.

long_df.to_csv("long.txt", sep='\t', encoding='utf-8')
long_df_from_file = pd.read_csv("long.txt", sep="\t")
long_df_from_file.head()

try:
    os.remove("long.txt")
except IOError as e:
    print(e)

11. Insert a new field ‘unit’ in the DataFrame at position 3 and setting value to empty.¶

long_df_from_file.insert(loc=3, column='unit', value='')

12. Adding new fields for each of the independent variable and associated URI, copying from ‘treatment field’.¶

long_df_from_file['var1_levels'] = long_df_from_file['treatment']
long_df_from_file['var1_uri'] = long_df_from_file['treatment']
long_df_from_file['var2_levels'] = long_df_from_file['treatment']
long_df_from_file['var2_uri'] = long_df_from_file['treatment']

# adding a new field for 'sample size' and setting the value to n=3
long_df_from_file['sample_size'] = 3

13. Marking up with ontology terms and their resolvable URI for all factor values.¶

This requires doing a manual mapping, better ways could be devised.

long_df_from_file.loc[long_df_from_file['treatment'] == 1, 'treatment'] = 'R. chinensis \'Old Blush\' sepals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 1, 'var1_levels'] = 'R. chinensis \'Old Blush\''
long_df_from_file.loc[long_df_from_file['var1_uri'] == 1, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 1, 'var2_levels'] = 'sepals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 1, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009031'

long_df_from_file.loc[long_df_from_file['treatment'] == 2, 'treatment'] = 'R. chinensis \'Old Blush\' stamens'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 2, 'var1_levels'] = 'R. chinensis \'Old Blush\''
long_df_from_file.loc[long_df_from_file['var1_uri'] == 2, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 2, 'var2_levels'] = 'stamens'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 2, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009029'

long_df_from_file.loc[long_df_from_file['treatment'] == 3, 'treatment'] = 'R. chinensis \'Old Blush\' petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 3, 'var1_levels'] = 'R. chinensis \'Old Blush\''
long_df_from_file.loc[long_df_from_file['var1_uri'] == 3, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 3, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 3, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 4, 'treatment'] = 'R. gigantea petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 4, 'var1_levels'] = 'R. gigantea'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 4, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74650'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 4, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 4, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 5, 'treatment'] = 'R. Damascena petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 5, 'var1_levels'] = 'R. Damascena'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 5, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_3765'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 5, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 5, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 6, 'treatment'] = 'R. Gallica petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 6, 'var1_levels'] = 'R. Gallica'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 6, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74632'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 6, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 6, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 7, 'treatment'] = 'R. moschata petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 7, 'var1_levels'] = 'R. moschata'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 7, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74646'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 7, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 7, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 8, 'treatment'] = 'R. wichurana petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 8, 'var1_levels'] = 'R. wichurana'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 8, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_2094184'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 8, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 8, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

14. Dealing with missing values:¶

setting empty values to zero for sample_mean and sem to enable calculation: to do this, we rely on Pandas fillna function.

long_df_from_file['sample_mean'] = long_df_from_file['sample_mean'].fillna("0")
long_df_from_file['sem'] = long_df_from_file['sample_mean'].fillna("0")

15. Reorganizing columns order in the DataFrame/File to match the Frictionless Tabular Data Package Layout.¶

This is done very easily in Pandas by passing desired column order as an array.

long_df_from_file = long_df_from_file[['chemical_name', 'inchi', 'chebi_identifier', 'var1_levels', 'var1_uri',
                                       'var2_levels', 'var2_uri', 'treatment', 'sample_size', 'sample_mean',
                                       'unit', 'sem']]
long_df_from_file.head()

16. We are now ready to write the file to disk as a UTF-8 encoded comma delimited file, with double-quoted values, and we¶

are also dropping the dataframe index from the output.

try:
    HOME=os.getcwd()
    # print("checking current directory #1: ",HOME)

    if not os.path.exists(os.path.join(HOME,'../data/processed/denovo')):
        # print("checking current directory #2: ", os.getcwd())
        os.makedirs(os.path.join(HOME,'../data/processed/denovo'))
        os.chdir(os.path.join(HOME,'../data/processed/denovo'))
        long_df_from_file.to_csv("rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv",
                         quoting=1,
                         doublequote=True, sep=',',
                         encoding='utf-8', index=False)
    else:
        os.chdir(os.path.join(HOME,'../data/processed/denovo'))

except IOError as e:
    print(e)        

17. The Final step is to validate the output against JSON data package specifications, which are stored in the¶

JSON Tabular DataPackage Definition folder.

os.chdir('./../../../')
LOCAL = os.getcwd()
print("moving to directory: ", os.getcwd())

package_definition = os.path.join(LOCAL,'./rose-metabo-JSON-DP-validated/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json')
file_to_test = os.path.join(LOCAL,'../data/processed/denovo/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv')

print ("JSON data package definition:", package_definition)
print("csv file to evaluate:", file_to_test)
try:
    pack = Package(package_definition)
    pack.valid
    pack.errors
    for e in pack.errors:
        print(e)

    report = validate(file_to_test)
    if report['valid']== True:
        print("Success! \n") 
        print("\'"+file_to_test + "\'"+ " is a valid Frictionless Tabular Data Package\n" + "It complies with the 'rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json' definition\n")
    else:
        print("hmmm, something went wrong. Please, see the validation report for tracing the fault")

except IOError as e:
    print(e)

Conclusion¶

This concludes this notebook, which shows how to convert a metabolite profiling dataset from a publication and create a FAIR data package. The other notebooks show you how to visualize and plot the dataset but also convert it to a semantic graph as a Linked Data representation, query it and plot from it. - 1-rose-metabolites-Python-analysis.ipynb - 2-rose-metabolites-Python-RDF-querying-analysis.ipynb - 3-rose-metabolites-R-analysis.ipynb (NB: requires making an R kernel available to Jupyter)

References¶

Authors¶

Authors

Name

ORCID

Affiliation

Type

ELIXIR Node

Contribution

Philippe Rocca-Serra

University of Oxford

Writing - Original Draft

Susanna-Assunta Sansone