Packaging ISA as a Research Object (RO) - Dataset Maturity Level 4¶
Abstract:¶
The goal of this tutorial is to show how to package a dataset, an ISA JSON-LD document with the associated raw data files and a computational workflow available as a CWL file in this example, as a minimal Research Object crate.
To do so, we will be using:
the python ISA-API
the python ro-crate.py library, which, with its alpha status only support a subset of the Research Object Crate specifications.
Let’s get started by getting all necessary modules:
import os
import json
import datetime
import isatools
import uuid
import hashlib
import datetime
from json import load
from rocrate.rocrate import ROCrate
from rocrate.model.person import Person
from rocrate.model.dataset import Dataset
from rocrate.model.softwareapplication import SoftwareApplication
from rocrate.model.computationalworkflow import ComputationalWorkflow
from rocrate.model.computerlanguage import ComputerLanguage
from rocrate import rocrate_api
Packaging the ISA various serializations (Tab, JSON, JSON-LD) as a Research Object Crate¶
With the previous notebooks (recipes FCBXY1 and FCBXY2), we generated several distinct ISA documents:
a basic ISA-Tab descriptor.
a more completely described ISA-JSON descriptor, meeting communication metadata annotation.
a semantically typed ISA JSON-LD descriptor, which is an RDF serialization of the same information.
We will be using the RDF serialization, associated raw data files (dummy FASTQ files), a computational workflow available as a CWL file.
1. Instantiating a Research Object
and providing basic metadata¶
ontology = "obo"
a_crate_for_isa = ROCrate()
# a_crate_for_isa.id = "#research_object/" + str(ro_id)
a_crate_for_isa.name = "ISA JSON-LD representation of BII-S-3"
a_crate_for_isa.description = "ISA study serialized as JSON-LD using " + ontology + " ontology mapping"
a_crate_for_isa.keywords = ["ISA", "JSON-LD"]
a_crate_for_isa.license = "https://creativecommons.org/licenses/by/4.0/"
# a_crate_for_isa.creator = Person(a_crate_for_isa, "https://www.orcid.org/0000-0001-9853-5668", {"name": "Philippe Rocca-Serra"})
test = a_crate_for_isa.add()
2. Improving Reusability by setting a license for the RO-Crate.¶
a_crate_for_isa.license = "https://creativecommons.org/licenses/by/4.0/"
3. Allowing proper credit by associating authors and creators author to a globally unique identifier.¶
In this case, we show how to use an ORCID to do so but using the creator
property of the RO-crate object
, and building
a Person
object
a_crate_for_isa.creator = Person(a_crate_for_isa,"https://www.orcid.org/0000-0001-9853-5668")
4. Adding two ISA RDF serializations to the newly created Research Object create
.¶
# instance_path = os.path.join("./output/BII-S-3-synth/", "isa-new_ids.json")
#
# with open(instance_path, 'r') as instance_file:
# instance = load(instance_file)
# instance_file.close()
isa_json_ld_path = os.path.join("./output/BII-S-3-synth/", "isa-new_ids-BII-S-3-ld-" + ontology + "-v1.json")
isa_nquads_path = os.path.join("./output/BII-S-3-synth/", "isa.ttl")
files = [isa_json_ld_path, isa_nquads_path ]
# with a python comprehension, we do it like this:
[a_crate_for_isa.add_file(file) for file in files]
5. Now adding a dataset to the Research Object, which is meant to describe a bag of associated images.¶
ds = Dataset(a_crate_for_isa, "raw_images")
ds.format_id="http://edamontology.org/format_3604"
ds.datePublished=datetime.datetime.now()
ds.as_jsonld=isa_json_ld_path
a_crate_for_isa.add(ds)
6. Next, we create a Computational Workflow
object and we add it to the Research Object
¶
tip
Note that the Computation Workflow may also be representated as an ISA Protocol Object.
wf = ComputationalWorkflow(a_crate_for_isa, "metagenomics-sequence-analysis.cwl")
wf.language="http://edamontology.org/format_3857"
wf.datePublished=datetime.datetime.now()
with open("metagenomics-sequence-analysis.cwl","rb") as f:
bytes = f.read()
new_hash = hashlib.sha256(bytes).hexdigest()
wf.hash=new_hash
a_crate_for_isa.add(wf)
7. Finally, we write the Research Object
to file¶
ro_outpath = "./output/BII-S-3-synth/ISA_in_a_ROcrate"
a_crate_for_isa.write_crate(ro_outpath)
with open(os.path.join(ro_outpath,"ro-crate-metadata.json"), 'r') as handle:
# print(handle)
parsed = json.load(handle)
print(json.dumps(parsed, indent=4, sort_keys=True))
8. Alternately, a zipped archive can be created as follows:¶
a_crate_for_isa.write_zip(ro_outpath)
et Voilà!
Conclusion:¶
With this content type, we have briefly introduced the notion of RO-Crate as a mechanism to package data and associated
metadata using a python library providing initial capability by offering a minimal implementation of the specifications.
The current iteration of the python library presents certain limitations. For instance, it does not provide the
necessary functionality to allow recording of Provenance
information. However, this can be easily accomplished by
extending the code.
The key message behind this recipe is simply to show that RO-crate can improve over simply zipping a bunch of files
together by providing a little semantic over the different parts making up an archive.
Also, it is important to bear in mind that the Research Object crate is nascent and more work is needed to define
use best practices and implementation profiles.
What to read next ?
What is Provenance information?
Upload to Zenodo and get a DOI
How to make workflow FAIR ?
Authors¶
Authors
Name |
ORCID |
Affiliation |
Type |
ELIXIR Node |
Contribution |
---|---|---|---|---|---|
University of Oxford |
Writing - Original Draft |
||||
University of Oxford |
Writing - Original Draft |