5. Registering Datasets¶
5.1. Main Objectives¶
Useful datasets become more useful when they are easily found. The FAIR principles state that:
F1. (Meta)data are assigned a globally unique and persistent identifier
A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
This recipe is about making an already archived dataset metadata more Findable by getting the metadata distributed into other databases:
Learn how to register a published dataset on Wikidata with the appropriate citation to increase its findability.
5.3. Main Content¶
Archiving a dataset can be an important step in making data FAIR (see Depositing in Zenodo generic repository, fcb:FCB009), but by sharing the metadata, you make the dataset even more Findable. Sharing the metadata in Wikidata has the advantage that the dataset can be linked to many other life sciences databases.
This recipe takes an interest in datasets with information about small chemical compounds, particularly datasets with an SDF file or spreadsheets containing chemical identifiers in the form of SMILES strings. But the Wikidata registration process itself, to increase Findability, works for any dataset. This Wikidata:Introduction describes the principle ideas and data model that will make this recipe easier to follow.
Importantly, registering a dataset in Wikidata allows it to be used as reference to support other Wikidata statements (existing or newly created ones). For example, if the dataset contains a logP value, then this logP value can be associated to the compound in Wikidata with a statement, with the dataset itself as reference.
5.3.1. Finding Datasets¶
There are several ways to locate datasets to be registered into Wikidata. One needs to take into consideration the following two requirements: the dataset should have a compatible license and the dataset should have an identifier.
Examples of where open datasets with information about chemical compounds can be found include:
Of course, there are also general data search engines now that may help find the data you are looking for:
If you have a new dataset and would like to deposit it because it is not listed in any of the above resources, please check out the Depositing in Zenodo generic repository recipe.
5.3.2. Dataset Registration in Wikidata¶
After locating a dataset, check the availability of the details of the dataset. Essential attributes include:
the title of the dataset
the author(s) of the dataset
the publication date of the dataset
links to the dataset source.
You may need these later, during the Wikidata registration process.
With the DOI, there is a good chance that Scholia 1 can help you register the dataset in Wikidata.
You can use the
https://scholia.toolforge.org/doi/$DOI pattern (‘$DOI’ in the link must be replaced with the
DOI of the dataset you are registering) to check if your dataset is already listed in
If not, this page will use the DOI to convert the associated metadata and translate them into a Wikidata compatible Scholia Quickstatements.
5.3.3. Adding a dataset with Quickstatements¶
Quickstatements is a layer on top of Wikidata, a tool developed by Magnus Manske, research in Cambridge/UK to make it easier to edit Wikidata in an automated way. We use this here to automate the registering of a dataset in Wikidata too. We first generate Quickstatements which can be added to Wikidata using the https://quickstatements.toolforge.org/ website. This is the step which requires a Wikidata account.
220.127.116.11. Step 3: Open the Scholia page¶
If your dataset is not already Wikidata, use the
page to create Quickstatements. The result should look something like this:
18.104.22.168. Step 4: Execute the Quickstatements¶
On the page from Step 3, click the blue “Submit to Quickstatements” button which will take you to the Quickstatements website, which will look something like this:
After you click “Run”, Quickstatements starts making edits in Wikidata, and when done, it should look like this:
5.3.4. Optional: adding additional information¶
The result page from Step 4 will include a link to the newly created Wikidata item. It will have a Wikidata identifier starting with an “Q”, for example Q108653787.
Additional information that can be provided include:
links to Wikidata items for authors using the P50 property for “author”
This Use Scholia and Wikidata to find scientific literature tutorial provides more information on how Wikidata uses keywords to further expose literature, but works identical to other research output, such as datasets.
Finn Årup Nielsen, Daniel Mietchen, and Egon Willighagen. Scholia, Scientometrics and Wikidata, pages 237–259. Volume 10577 of Lecture Notes in Computer Science. Springer International Publishing, 2017. URL: http://link.springer.com/10.1007/978-3-319-70407-4_36, doi:10.1007/978-3-319-70407-4_36.
Andra Waagmeester, Gregory Stupp, Sebastian Burgstaller-Muehlbacher, Benjamin M Good, Malachi Griffith, Obi L Griffith, Kristina Hanspers, Henning Hermjakob, Toby S Hudson, Kevin Hybiske, and et al. Wikidata as a knowledge graph for the life sciences. eLife, 9:e52614, Mar 2020. doi:10.7554/eLife.52614.
This page is released under the Creative Commons 4.0 BY license.