1. Unique, persistent identifiers¶
The FAIR principles, under the Findability
and the Accessibility
chapters respectively, state that:
F1. (Meta)data are assigned a globally unique and persistent identifier
A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
1.1. Main Objectives¶
The main goals of this recipe are therefore:
To understand the purpose of a globally unique and persistent identifier and how they can be used to retrieve the associated (meta)data using a standardized communication protocol. To provide explanations on how to generate globaly unique identifiers, explain what IRIs are and how they can be generated, retrieved and resolved.
From these principles, it is necessary to explain three key processes, which are:
1.1.1. Identifier minting¶
Tip
Identifier minting
is fundamentally about the authority deciding identity
.
the tax office
the HR department
the company registry
the EMBL-EBI
1.1.2. URI construction¶
Tip
URI construction
is fundamentally about scoping the authority
.
for example, should the web address be: http://organization/people/1123 or http://organization/commercial/people/1123
1.1.3. URI Resolution:¶
Tip
URI resolution
is fundamentally about directing requests to the relevant identified entity
The standard approach would be resolving a HTTP GET
request using content negotiation to choose between different representations of the resource.
All these key points will be developed in this recipe.
1.3. Minting identifiers¶
Tip
*Identifier minting is fundamentally about the authority deciding identity.
Identifiers are used to tag, identify, find and retrieve entities which are part of a collection or a resource maintained by some organization. This organization is the authority
which rules over that area of knowledge. The core assumption is that identifiers must be unique, that is they can not be shared and there is a 1 to 1 relation between the ‘identifier’ and the entity it identifies.
With isolated systems, disconnected from any other system, the risk of identifier collision is null but two isolated systems can create local identifiers which could be completely identical but which denote completely different entities. In fact, this happens all the time.
So these identifiers are said to be locally unique, as there is no guarantee these are unique to all other systems that exist, in other words, that they are globally unique.
1.4. How to produce globally unique identifiers?¶
There are 2 ways to produce non-resolvable, globally unique identifiers:
1.4.1. UUID based identifiers¶
With this approach, the notion of universally unique
is a probabilistic one. The probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion. The likelihood of collision (generation of the exact same identifier) is extremely small but not null. Therefore, with an ever increasing number of digital resources to index, collisions should not be ruled out.
According to the RFC4122 specifications, a UUID is an identifier that is unique across both space and time, with respect to the space of all UUIDs. Since a UUID is a fixed size and contains a time field, it is possible for values to rollover (around A.D. 3400, depending on the specific algorithm used). A UUID can be used for multiple purposes, from tagging objects with an extremely short lifetime, to reliably identifying very persistent objects across a network.
Note
Key Fact about UUID:
no centralized authority is required to administer them
content independent, entirely disconnected from the identify they can be associated with for identification purpose
generation on demand can be completely automated
non resolvable
completely semantic free (opaque) identifier
Generation in Python The following code snippet shows the generation of a UUID using the Python uuid package.
import uuid
id = uuid.uuid4()
print(id)
5b6d0be2-d47f-11e8-9f9d-ccaf789d94a0
1.4.2. Hash based identifiers¶
This approach uses 2 inputs:
a cryptographic hashing algorithm implemented as a software function
a digital resource (e.g. a file)
Indeed, the approach generates an identifier by using all or some of the content of the digital resource as input to the cryptographic hashing function to compute a unique string, which is therefore a signature (or fingerprint) of the digital resource 9, 10. A number of algorithms can be used and some are already widely used such as Message Digest algorithm MD5 specified by the RFC1321 3, the Secure Hash Algorithm (SHA1), Secure Hash Algorithm 2 (SHA256), Secure Hash Algorithm 3 (SHA3) or BLAKE2b-256 (RFC 7693) 1. The first two are considered obsolete, while the latter two are most advanced and approved by NIST.
Note
Key fact about hash identifiers: It is not possible to reconstruct the original data from these hash strings. These are only fingerprints, which can therefore only be used to do the following tasks:
message authentication
digital signature
public key encryption
1.4.2.1. Generation in Python¶
The following code snippet shows the generation of a hash for a string using the Python hashlib package:
import hashlib
# encode it to bytes using UTF-8 encoding
message = "creating globally unique identifiers for FAIR data".encode()
# hash with MD5 (not recommended)
print("MD5:", hashlib.md5(message).hexdigest())
# hash with SHA-2 (SHA-256 bits & SHA-512 bits long)
print("SHA-256:", hashlib.sha256(message).hexdigest())
print("SHA-512:", hashlib.sha512(message).hexdigest())
# hash with SHA-3
print("SHA-3-256:", hashlib.sha3_256(message).hexdigest())
print("SHA-3-512:", hashlib.sha3_512(message).hexdigest())
# hash with BLAKE2 (256-bits BLAKE2s & 512-bits BLAKE2c)
print("BLAKE2s:", hashlib.blake2s(message).hexdigest())
print("BLAKE2b:", hashlib.blake2b(message).hexdigest())
1.4.2.2. Generation of hashes using curl¶
The following snippet shows how a b2sum hash can be generated using curl
curl https://fairplus.github.io/cookbook-dev/intro | b2sum --length 256 --binary
24d470987fda1278c63c3f97ab30869b821906449f3ecf290ee48086b8215668
In our context, the use of the hashing function is to generate a unique key which may be used to generate a URL. This simply indicates a technical option for generating opaque URL, not that it is necessarily the most widespread approach.
1.5. Understanding Uniform Resource Locators (URLs)¶
Tip
URI construction is fundamentally about scoping the authority.
Having covered the technical details to generated globally unique identifiers, it is now necessary to discuss the issue making identifiers resolvable (a notion also known as dereferenceable
).
In other words, in order to create globally unique identifiers for the web
, it is necessary to understand what Uniform Resource Locators (a.k.a URL
) are and how to construct them for use with the Hypertext Transfer Protocol.
This results in URLs of the following form
userinfo host port
┌──┴───┐ ┌──────┴──────┐ ┌┴┐
https://john.doe@www.example.com:123/forum/questions/?tag=networking&order=newest#top
└─┬─┘ └───────────┬──────────────┘└───────┬───────┘ └───────────┬─────────────┘ └┬┘
scheme authority path query fragment
source:https://en.wikipedia.org/wiki/Uniform_Resource_Identifier
The structure of URL
, according to the World Wide Web Consortium (W3C) specification, is as follows:
URI = scheme:[//authority]path[?query][#fragment]
1.5.1. scheme
:¶
In this structure, the scheme
defines the protocol or application to use to obtain the resource. The list of official scheme
is maintained by the Internet Assigned Numbers Authority and the following link (https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml) holds the most up to date version.
The most relevant URI scheme
in the context of FAIR data and Linked Open Data are http
and https
which denote the Hypertext Transfer Protocol
and the Hypertext Tranfer Protocol Secure
.
1.5.2. authority
:¶
Besides setting the scheme
, the other essential fragment of a URI is the authority
, which according to the Internet Engineering Task Force (IETF) specifications, presents the following characteristics:
authority = [userinfo@]host[:port]
Note how the required part is the host
, with userinfo
and port
information being optional and should be avoided in identifiers for data.
1.5.3. host
:¶
In the authority
, the notion of host
corresponds to the Internet Protocol (IP) address
of a server hosting a resource. Often, the IP address corresponding to the host
is given a host name
such as www.example.org
. The host name
should be a Qualified Domain Name
at minima, or a Fully Qualified Domain Name (FQDN)
ideally and registered with the Domain Name Service (DNS)
which allows the resolution (lookup) between the ip address
and the hostname
.
Note
Tip
it is often the case the authority
is reduced to the host
, which is then referred to as a ‘namespace’ or ‘domain name’ in an abuse of language.
host
is in fact further specified by 3 element
top-level domain ,
com
in the www.example.com web addresssecond-level domain,
example
in the www.example.com web addresshostname subdomain,
www
in the www.example.com web address
Note
Tip
subdomain
can be defined in the Domain Name Service and belong to the main domain. Technically, to add a subdomain pointing to the domain name, one needs to create/add a CNAME to the DNS for a registered domain name
1.5.4. path
:¶
The path
defines the directory on the host
where the resource is located and consists of a sequence of zero or more path segements separated by a /
.
1.5.5. query
:¶
The query
is an optional part of the URL syntax that starts with a ?
. Typically the query
component consists of a service of key-value pairs separated by an &
deliminator.
In the context resolvable identiers, query
components should be avoided.
1.5.6. fragment
¶
The fragement
is an optional part of the URL syntax that starts with a #
. It identifies a component within the returned resource and is used for client side processing, e.g. to scroll to a particular section within a webpage.
1.6. Generating Resolvable URLs¶
In the context of FAIR data, resources on the web must have unique, persistent, and resolvable identifiers.
In order to achieve the capability of persistence
, it is necessary for the resource identifiers to comply to the RFC 3986 IETF standard for URIs (and IRIs, which are URI extended to cope with unicode). This means that it must comprise the following components:
scheme: https
an authority: www.example.com
optionally a path:
/dataset-name/
a local identifier (such as database accession number, such as P12133 from uniprot) or a globally unique identifier (such as a UUID or hash code).
In a virtual example which uses a UUID for the local identifier and does not use a path, it looks like this:
https://www.example.com/5b6d0be2-d47f-11e8-9f9d-ccaf789d94a0
Taking a real life example, to make the UniProt accession number
globally unique, one needs to provide the context in which the accession number is unique. This can be done by converting it into an International Resource Identifier
(IRI - commonly referred to as a URL) by appending the local identifier onto a namesapce.
Tip
You should only use a
namespace
over which you have ownership (the authority), otherwise you cannot guarantee that the minted IRI will be globally unique; the organization or person who owns the namespace may already, or at some point in the future, use the IRI that you created for some other purpose.
In the case of UniProt, the resource has provided IRIs for each page about a protein as well as separate IRIs for the protein itself; this is because the page is not the concept of the protein by a document that describes properties of the protein. This separation of identities is achieved by using different namespaces for the different types of resource.
UniProt P38398 web page IRI: https://www.uniprot.org/uniprot/P38398
UniProt protein P38398 IRI: http://purl.uniprot.org/uniprot/P38398
Once such URIs are available, one may also turn them into compact identifiers called CURIEs. This will be discussed further in the next section.
1.7. Identifier Resolution - Enabling persistence through indirection¶
This relates to the following FAIR principle mentioned in the introduction:
A1. (Meta)data are retrievable by their identifier using a standardised communications protocol.
Tip
URI resolution
is fundamentally about directing requests to the relevant identified entity.
The standard approach would be resolving a HTTP GET
request using content negotiation to choose between different representations of the resource.
A PURL is a persistent URL
, meaning that it provides a permanent address to access a resource on the web.
To understand the notion of PURL, one needs to first get familiar with the notion of url indirection
(also known as url redirect
or url forwarding
), which refers to the practice of providing a stable, fixed web address/url, but setting it up so that it points to another content, which may be periodically modified.
When a user retrieves a PURL, they will be redirected
to the current location of the resource.
When an author needs to move a page, they can update the PURL to point to the new location.
Tip
The practice of indirection comes handy as it ensures invariant url address for resources which are known to change, owing to version changes for instance or owing to change in ownership.
We can see this practice in action with the reliance on purl.org url for identifying OBO Foundry resources. For instance, the following url http://purl.obolibrary.org/obo/stato.owl
is a redirect to the latest release of the file, which is https://raw.githubusercontent.com/ISA-tools/stato/dev/releases/latest_release/stato.owl.
PURLs with a common prefix
are grouped together into domains. Each domain has a single maintainer who can add new PURLs to the domain and make changes to existing PURLs within the domain.
FAIR Principle A1 states that:
(meta)data should be retrievable by its identifier.
When the identifier is not a resolvable URL, then Identifier Resolution Services
are required that know how to map an IRI to a location for the data.
1.7.1. Introducing CURIES or Compact URIs¶
CURIEs (short for compact URIs) are defined by a World Wide Web Consortium Working Group Note CURIE Syntax 1.0, and provide a human readable shortening of IRIs.
The CURIE consists of a namespace prefix
followed by the local identifier
.
There are some widely used and defined CURIEs such as DOIs and ISBN numbers. For example the DOI [doi:10.1038/sdata.2016.18]
refers to the FAIR Principles paper. The Digital Object Identifier System web site (https://www.doi.org/) provides a resolution service for DOIs. The service is available as a web form on the site or can be used by appending a DOI to the website.The client will be redirected to the URL where the resource about the concept is located, e.g. for the FAIR Data Principles paper we can use the URL https://www.doi.org/10.1038/sdata.2016.18 to resolve the paper’s DOI. This results in the client being taken to the page at https://www.nature.com/articles/sdata201618.
Namespaces
can be defined by convention, such as the case with doi
, and registered with services to allow for the resolution of CURIEs (see Identifier Resolution Services below). These are extensively used to map CURIEs to URLs that can be resolved.
Going back to our Life Science context, we can use the following CURIE [uniprot:P38398]
to refer to the UniProt record for the protein.
This is very useful for including unambiguous, global identifiers in scientific articles.
1.7.2. Identifier Resolution services¶
-
The PURL system is a service of the Internet Archive, which provides an interface to administer domain. For more information about the service, visit https://archive.org/services/purl/help
-
Permanent Identifiers for the Web. Secure, permanent URLs for your Web application that will stand the test of time.
authority registration service
resolution service
redirection service:
Send a request to add a redirect to the public-perma-id@w3.org mailing list. Make sure to include the URL that you want on w3id.org, the URL that you want to redirect to, and the HTTP code that you want to use when redirecting. An administrator will then create the redirect for you.
-
Identifiers.org is a Resolution Service provides consistent access to life science data using
Compact Uniform Resource Identifiers
, hosted by the EBI provides a resolution service, both as a web form and through the URL pattern 7.Compact Identifiers
consist of anassigned
,unique
prefix
and alocal provider designated
accession number
(prefix:accession). The resolving location ofCompact Identifiers
is determined using information that is stored in the Identifiers.org Registry. Datasets can register their namespaceprefix
together with theiridentifier pattern
. The service can then be used in the same way as the DOI resolution service. So for the UniProt page about BRCA1, we can resolve the CURIE[uniprot:P38938]
using Identifiers.org. This means that the URL https://identifiers.org/uniprot:P38938 resolves to the UniProt page https://www.uniprot.org/uniprot/P38938. -
Name2Things (N2T) is a Resolution Service, maintained at the California Digital Library (CDL) within the University of California (UC) Office of the President. CDL supports electronic library services for ten UC campuses and affiliated law schools, medical centers, and national laboratories, as well as hundreds of museums, herbaria, botanical gardens, etc. Similar to URL shorteners like bit.ly, N2T serves content indirectly. N2T can store more than one “target” (forwarding link) for an identifier, as well as any kind or amount of metadata (descriptive information) N2T.net is also a “meta-resolver”. In collaboration with identifiers.org, it recognizes over 600 well-known identifier types and knows where their respective servers are. Failing to find forwarding information for a specific individual identifier, it uses the identifier’s type to look for an overall target rule.
-
Bioregistry is a Resolution Service, developed in a GitHub repository. Like Identifiers.org it has a registry, but also a registry of registries, and it imports data from Identifiers.org and Name-to-Thing but extends beyond identifiers for things but also supports, for example, ontologies. As a community effort, new namespace prefixes and their identifier patterns can be registered via GitHub issues. Compact identifiers are supported and the URL https://bioregistry.io/chebi:138488 resolves to the ChEBI page https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:138488. Bioregistry provides an API to query the registry itself.
For more details, see the Identifier Resolution Services recipe.
1.8. Conclusion¶
In this recipe, we have given an overview of globally unique and persistent identifier 8, 6, i.e. FAIR principle F1. We have covered:
The difference between global and local identifiers;
How to convert a local identifier into a global one;
Opaque and transparent identifiers
We have given an overview of the different services available for handling identifiers.
But we can not conclude this section on persistent identifiers without stressing how central they are to the production of Linked Data or Linked Open Data, which rely on 3 W3C standards: URI5,2,11, RDF4 and HTTP.
1.8.1. What to read next?¶
Learn more about:
FAIRsharing records appearing in this recipe:
- Bioregistry (bioregistry)
- Chemical Entities of Biological Interest (ChEBI)
- Compact URI (CURIE)
- Digital Object Identifier (DOI)
- GitHub
- OBO Foundry (OBO)
- Persistent Uniform Resource Locator (PURL)
- Resource Description Framework (RDF)
- The FAIR Principles (FAIR)
- UniProt Knowledgebase (UniProtKB)
- Uniform Resource Identifier (URI)
- Uniform Resource Locator (URL)
- w3id.org (w3id)
1.9. References¶
References
- 1
Blake2 specifications. URL: https://tools.ietf.org/html/rfc7693.
- 2
Cool uris don’t change. URL: https://www.w3.org/Provider/Style/URI.
- 3
Md5 specifications. URL: https://tools.ietf.org/html/rfc1321.
- 4
Rdf concepts. URL: https://www.w3.org/TR/rdf-concepts/.
- 5
Url. URL: https://tools.ietf.org/html/rfc1738.
- 6
Rachana Ananthakrishnan, Kyle Chard, Mike D’Arcy, Ian Foster, Carl Kesselman, Brendan McCollam, Jim Pruyne, Philippe Rocca-Serra, Robert Schuler, and Rick Wagner. An open ecosystem for pervasive use of persistent identifiers. In Practice and Experience in Advanced Research Computing, PEARC ‘20, 99–105. New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3311790.3396660, doi:10.1145/3311790.3396660.
- 7
N. Juty, N. Le Novère, and C. Laibe. Identifiers.org and MIRIAM Registry: community resources to provide persistent identification. Nucleic Acids Res, 40(Database issue):D580–586, Jan 2012.
- 8
J. A. McMurry, N. Juty, N. Blomberg, T. Burdett, T. Conlin, N. Conte, M. Courtot, J. Deck, M. Dumontier, D. K. Fellows, A. Gonzalez-Beltran, P. Gormanns, J. Grethe, J. Hastings, J. K. Hériché, H. Hermjakob, J. C. Ison, R. C. Jimenez, S. Jupp, J. Kunze, C. Laibe, N. Le Novère, J. Malone, M. J. Martin, J. R. McEntyre, C. Morris, J. Muilu, W. Müller, P. Rocca-Serra, S. A. Sansone, M. Sariyar, J. L. Snoep, S. Soiland-Reyes, N. J. Stanford, N. Swainston, N. Washington, A. R. Williams, S. M. Wimalaratne, L. M. Winfree, K. Wolstencroft, C. Goble, C. J. Mungall, M. A. Haendel, and H. Parkinson. Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol, 15(6):e2001414, Jun 2017.
- 9
Adam Retter. Archival catalogue record identifiers. URL: https://blog.adamretter.org.uk/archival-catalog-identifiers/.
- 10
Adam Retter. Archival identifiers for digital files. URL: https://blog.adamretter.org.uk/archival-identifiers-for-digital-files/.
- 11
Leo Sauermann and Richard Cyganiak. Cool uris for the semantic web, w3c semantic web education and outreach interest group note. 2008. URL: https://www.w3.org/TR/cooluris/.
1.10. Authors¶
Authors
Name |
ORCID |
Affiliation |
Type |
ELIXIR Node |
Contribution |
---|---|---|---|---|---|
Novartis AG |
Conceptualization |
||||
Heriot Watt University |
Writing - Original Draft |
||||
Maastricht University |
Writing - Original Draft |
||||
Maastricht University |
Writing - Original Draft |
||||
University of Oxford |
Writing - Review & Editing, Conceptualization |