1.3. Main Content¶
To make datasets more findable and databases more interoperable, it is recommended to share not just top-level metadata, but also record-level metadata. For example, for chemical databases, one may share the chemical compound identifiers as open data, while keeping the data itself in their own database. That allows other resources, like Wikidata, to link to a private database. First, that makes such data more findable, but it also makes the same data easier to integrate, and therefore more interoperable.
This recipe follows this idea and starts with a CC0-licensed mapping file, linking InChIKeys and database identifiers. Recipe InChI and SMILES identifiers for chemical structures (fcb:FCB007) explains how InChIKeys can be generated for chemical compounds in your database.
As an example, this recipe demonstrates the approach of listing SwissLipid 1 identifiers in Wikidata, developed at the 2021 BioHackathon Europe. Earlier, a Swiss Lipid property for Wikidata was already proposed, later approved, and created in Wikidata just before the 2020 BioHackathon Europe. An existing Wikidata property is a requirement to adding the external identifiers.
The recipe uses Bacting 3, which can be used in Groovy and Python.
We will use the first here. The starting point is mappings (tuples) that link the SwissLipid identifiers to InChIKeys. The latter is used to find the matching Wikidata items.
1.3.1. Step 1: getting the data¶
The SwissLipids data licensed under CC-BY-4.0, but because Wikidata is CC0, additional permission was needed. Various strategies can be used here, for example asking the provider:
to make a subset with identifier mappings available under CC0 (DrugBank uses this approach)
permission to release the mappings to Wikidata (effectively making them CC0)
For SwissLipids an email exchange with the SIB gave me permission to add the SwissLipids identifiers to Wikidata.
With the licensing issue resolved, the following practical steps were taken:
Download
lipids.tsv
from the Downloads pageGunzip the file
1.3.2. Step 2: extract Swiss Lipid ID <> InChIKey tuples¶
For this step, use csvtool
(apt get install csvtool
):
csvtool -t TAB col 1,11 swisslipids.tsv
The output needs some further clean up, like removing lines without InChIKeys or “none” and “-” as value. Also, the “InChIKey=” prefix is removed in preparation for the next step. The full used code is:
csvtool -t TAB col 1,11 swisslipids.tsv | sed 's/InChIKey=//' | grep -v "none" | grep -v ",-$" | grep -v ",$" | tee swisslipids_ids.tsv
This results in almost 600k tuples:
$ wc -l swiss*tsv
592412 swisslipids_ids.tsv
777957 swisslipids.tsv
1.3.3. Step 3: creating a ShEx model¶
Here the task is to create a shape expression for Wikidata, to model how the identifiers will be added to Wikidata. See the recipe publication by Waagmeester et al. about A protocol for adding knowledge to Wikidata: aligning resources on human coronaviruses 2.
1.3.4. Step 4: creating QuickStatements¶
Now we have the mappings and the data model in Wikidata, we can create QuickStatements to allow us to enter the data into Wikidata. This is not the only approach, and the process can be further automated using “Wikidata bots”. For this, see BioHackathon Europe Project 32: Connecting ELIXIR-related open data on Wikidata via WikiProject ELIXIR.
Based on existing Bacting scripts, a script is created to take the swisslipids_ids.tsv
file as input and create
QuickStatements: https://github.com/egonw/ons-wikidata/blob/master/ExtIdentifiers/swisslipids.groovy This script is using Apache Groovy
but Bacting can also be using in Python, see https://github.com/cthoyt/pybacting.
Note
This script uses a federated query against https://beta.sparql.swisslipids.org/sparql/ after a suggestion by Dr Jerven Bolleman who indicated that the RDF4J backing this SPARQL endpoint will automatically batch the query against Wikidata, overcoming limitations of the Wikidata Query Service:
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?wd ?key ?value WHERE {
SERVICE <https://query.wikidata.org/sparql> {
SELECT (substr(str(?compound),32) as ?wd) ?key ?value WHERE {
?compound wdt:P235 ?key .
OPTIONAL { ?compound wdt:P8691 ?value . }
}
}
}
This creates a file that looks like:
Q76738581 P8691 "SLM:000163948" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
Q76865359 P8691 "SLM:000163954" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
Q76865370 P8691 "SLM:000163964" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
Q76866423 P8691 "SLM:000163966" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
Q76865004 P8691 "SLM:000163968" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
Q76733356 P8691 "SLM:000164019" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
Q76733312 P8691 "SLM:000164023" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
Q76737210 P8691 "SLM:000164026" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
Q76736881 P8691 "SLM:000164032" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
Q76735022 P8691 "SLM:000164034" S248 Q41165322 S854 "https://www.swisslipids.org/#/downloads" S813 +2021-11-06T00:00:00Z/11
This resulted in about 17.5 thousand mappings. These are based on exact InChIKey match.
1.3.5. Step 5: running the QuickStatements¶
The resulting mappings are being added via the QuickStatements website.