Starting material¶
Background:¶
Experimental results such as metabolite profiling data published in [1,2] can be straightforwardly reported using OKFN Data Packages. Such components can be easily parsed as data frames and exploiting for data visualization purpose using libraries implementing graphical grammar concepts. Here, we show how to use a set of python libraries to create a tabular data package from an Excel file, annotate it with ontologies (CHEBI, PO, NCBITax) and validate the results against the JSON definition of the data table. A few line of codes allow structure information around key study design descriptors: the independent variables and their levels have been clearly and unambiguously declared in the Tabular Data Package itself.
1. Let’s begin by installing the Python packages allowing easy access and use of data formatted as JSON Data Package¶
import os
import libchebipy
import re
import pandas as pd
from datapackage import Package
from goodtables import validate
2. Reading the data¶
We now simply read in the Excel file corresponding to the Nature Genetics Supplementary Table from the Zenodo archive
(DOI: https://doi.org/10.5281/zenodo.2598799)
#df = pd.read_excel('Supplementary Data 3.xlsx', sheet_name='Feuil1')
df = pd.read_excel('https://zenodo.org/api/files/91a610cb-8f1f-4ec5-9818-767a75a7a820/Supplementary%20Data%203.xlsx', sheet_name='Feuil1')
df.head(25)
3. Following a manual inspection of the Excel Source, getting the start row of the data, we use Pandas take() function¶
to extract first a row of headers (hence -axis set to 0)
header_treatment = df.take([13], axis=0)
4. We then extract all the columns of interest (same take() function, with -axis set to 1)¶
data_full = df.take([3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], axis=1)
# We now trim by removing the first 15 rows which contain no information
data_slice = data_full.take([16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76], axis=0)
5. We now rename the DataFrame automatically generated field header to something more meaningful¶
data_slice.rename(columns={"Unnamed: 3": "chemical_name",
"Unnamed: 4": "sample_mean_1",
"Unnamed: 5": "sem_1",
"Unnamed: 6": "sample_mean_2",
"Unnamed: 7": "sem_2",
"Unnamed: 8": "sample_mean_3",
"Unnamed: 9": "sem_3",
"Unnamed: 10": "sample_mean_4",
"Unnamed: 11": "sem_4",
"Unnamed: 12": "sample_mean_5",
"Unnamed: 13": "sem_5",
"Unnamed: 14": "sample_mean_6",
"Unnamed: 15": "sem_6",
"Unnamed: 16": "sample_mean_7",
"Unnamed: 17": "sem_7",
"Unnamed: 18": "sample_mean_8",
"Unnamed: 19": "sem_8"}, inplace=True)
6. We insert 2 new fields as placeholders for chemical information descriptors¶
We then reinitialize the dataframe index so row numbering start at 0, not 16
data_slice.insert(loc=1, column='inchi', value='')
data_slice.insert(loc=2, column='chebi_identifier', value='')
data_slice = data_slice.reset_index(drop=True)
7. We use LibChebi to retrieve CHEBI identifiers and InChi from a chemical name.¶
Note: in this call, we retrieve only values for which an exact match on the chemical name is found in Chebi libchebi API does not allow easy searching on synonyms, thus we are failing to retrieve all relevant information. This is merely to showcase how to use libchebi.
for i in range(0, 60):
hit = libchebipy.search(data_slice.loc[i, 'chemical_name'], True)
if len(hit) > 0:
print("HIT: ", data_slice.loc[i, 'chemical_name'], ":", hit[0].get_inchi(), "|", hit[0].get_id())
data_slice.loc[i, 'inchi'] = hit[0].get_inchi()
data_slice.loc[i, 'chebi_identifier'] = hit[0].get_id()
else:
print("Nothing found: ", data_slice.loc[i, 'chemical_name'])
data_slice.loc[i, 'inchi'] = ''
data_slice.loc[i, 'chebi_identifier'] = ''
8. The following steps are needed to perform the table transformation from a ‘wide’ layout to a ‘long table’ one.¶
Prep stubnames - pick out all the feature_model variables and remove the model suffices ‘long table’. The layout is that relied on by Frictionless Tabular Data Packages and consumed by R ggplot2 library and Python plotnine library.
Step1: obtain all the different ‘dimensions’ measured for a given condition (i.e. repeating fields with an increment suffix).
feature_models = [col for col in data_slice.columns if re.match("(sample_mean|sem)_[0-9]", col) is not None]
features = list(set([re.sub("_[0-9]", "", feature_model) for feature_model in feature_models]))
Step2: invoke Pandas pd.wide_to_long() function to carry out the table transformation. See Pandas documentation for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html and the excellent blog: https://medium.com/@wangyuw/data-reshaping-with-pandas-explained-80b2f51f88d2
long_df = pd.wide_to_long(data_slice, i=['chemical_name'], j='treatment', stubnames=features, sep="_")
Apparently a feature in Pandas DataFrame causes a mismatch in the field position.¶
We solve this by writing the DataFrame to file and reading it back in again, not ideal but it does the trick. So writing to a temporary file & reading from that file to solve the issue.
long_df.to_csv("long.txt", sep='\t', encoding='utf-8')
long_df_from_file = pd.read_csv("long.txt", sep="\t")
long_df_from_file.head()
try:
os.remove("long.txt")
except IOError as e:
print(e)
11. Insert a new field ‘unit’ in the DataFrame at position 3 and setting value to empty.¶
long_df_from_file.insert(loc=3, column='unit', value='')
12. Adding new fields for each of the independent variable and associated URI, copying from ‘treatment field’.¶
long_df_from_file['var1_levels'] = long_df_from_file['treatment']
long_df_from_file['var1_uri'] = long_df_from_file['treatment']
long_df_from_file['var2_levels'] = long_df_from_file['treatment']
long_df_from_file['var2_uri'] = long_df_from_file['treatment']
# adding a new field for 'sample size' and setting the value to n=3
long_df_from_file['sample_size'] = 3
13. Marking up with ontology terms and their resolvable URI for all factor values.¶
This requires doing a manual mapping, better ways could be devised.
long_df_from_file.loc[long_df_from_file['treatment'] == 1, 'treatment'] = 'R. chinensis \'Old Blush\' sepals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 1, 'var1_levels'] = 'R. chinensis \'Old Blush\''
long_df_from_file.loc[long_df_from_file['var1_uri'] == 1, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 1, 'var2_levels'] = 'sepals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 1, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009031'
long_df_from_file.loc[long_df_from_file['treatment'] == 2, 'treatment'] = 'R. chinensis \'Old Blush\' stamens'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 2, 'var1_levels'] = 'R. chinensis \'Old Blush\''
long_df_from_file.loc[long_df_from_file['var1_uri'] == 2, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 2, 'var2_levels'] = 'stamens'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 2, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009029'
long_df_from_file.loc[long_df_from_file['treatment'] == 3, 'treatment'] = 'R. chinensis \'Old Blush\' petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 3, 'var1_levels'] = 'R. chinensis \'Old Blush\''
long_df_from_file.loc[long_df_from_file['var1_uri'] == 3, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 3, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 3, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'
long_df_from_file.loc[long_df_from_file['treatment'] == 4, 'treatment'] = 'R. gigantea petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 4, 'var1_levels'] = 'R. gigantea'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 4, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74650'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 4, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 4, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'
long_df_from_file.loc[long_df_from_file['treatment'] == 5, 'treatment'] = 'R. Damascena petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 5, 'var1_levels'] = 'R. Damascena'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 5, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_3765'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 5, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 5, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'
long_df_from_file.loc[long_df_from_file['treatment'] == 6, 'treatment'] = 'R. Gallica petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 6, 'var1_levels'] = 'R. Gallica'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 6, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74632'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 6, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 6, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'
long_df_from_file.loc[long_df_from_file['treatment'] == 7, 'treatment'] = 'R. moschata petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 7, 'var1_levels'] = 'R. moschata'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 7, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74646'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 7, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 7, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'
long_df_from_file.loc[long_df_from_file['treatment'] == 8, 'treatment'] = 'R. wichurana petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 8, 'var1_levels'] = 'R. wichurana'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 8, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_2094184'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 8, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 8, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'
14. Dealing with missing values:¶
setting empty values to zero for sample_mean and sem to enable calculation: to do this, we rely on Pandas fillna function.
long_df_from_file['sample_mean'] = long_df_from_file['sample_mean'].fillna("0")
long_df_from_file['sem'] = long_df_from_file['sample_mean'].fillna("0")
15. Reorganizing columns order in the DataFrame/File to match the Frictionless Tabular Data Package Layout.¶
This is done very easily in Pandas by passing desired column order as an array.
long_df_from_file = long_df_from_file[['chemical_name', 'inchi', 'chebi_identifier', 'var1_levels', 'var1_uri',
'var2_levels', 'var2_uri', 'treatment', 'sample_size', 'sample_mean',
'unit', 'sem']]
long_df_from_file.head()
16. We are now ready to write the file to disk as a UTF-8 encoded comma delimited file, with double-quoted values, and we¶
are also dropping the dataframe index from the output.
try:
HOME=os.getcwd()
# print("checking current directory #1: ",HOME)
if not os.path.exists(os.path.join(HOME,'../data/processed/denovo')):
# print("checking current directory #2: ", os.getcwd())
os.makedirs(os.path.join(HOME,'../data/processed/denovo'))
os.chdir(os.path.join(HOME,'../data/processed/denovo'))
long_df_from_file.to_csv("rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv",
quoting=1,
doublequote=True, sep=',',
encoding='utf-8', index=False)
else:
os.chdir(os.path.join(HOME,'../data/processed/denovo'))
except IOError as e:
print(e)
17. The Final step is to validate the output against JSON data package specifications, which are stored in the¶
JSON Tabular DataPackage Definition folder.
os.chdir('./../../../')
LOCAL = os.getcwd()
print("moving to directory: ", os.getcwd())
package_definition = os.path.join(LOCAL,'./rose-metabo-JSON-DP-validated/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json')
file_to_test = os.path.join(LOCAL,'../data/processed/denovo/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv')
print ("JSON data package definition:", package_definition)
print("csv file to evaluate:", file_to_test)
try:
pack = Package(package_definition)
pack.valid
pack.errors
for e in pack.errors:
print(e)
report = validate(file_to_test)
if report['valid']== True:
print("Success! \n")
print("\'"+file_to_test + "\'"+ " is a valid Frictionless Tabular Data Package\n" + "It complies with the 'rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json' definition\n")
else:
print("hmmm, something went wrong. Please, see the validation report for tracing the fault")
except IOError as e:
print(e)
Conclusion¶
This concludes this notebook, which shows how to convert a metabolite profiling dataset from a publication and create a FAIR data package. The other notebooks show you how to visualize and plot the dataset but also convert it to a semantic graph as a Linked Data representation, query it and plot from it. - 1-rose-metabolites-Python-analysis.ipynb - 2-rose-metabolites-Python-RDF-querying-analysis.ipynb - 3-rose-metabolites-R-analysis.ipynb (NB: requires making an R kernel available to Jupyter)
References¶
References
Authors¶
Authors
Name |
ORCID |
Affiliation |
Type |
ELIXIR Node |
Contribution |
---|---|---|---|---|---|
University of Oxford |
Writing - Original Draft |
||||
University of Oxford |
Writing - Review & Editing, Funding Acquisition |