1. Unique, persistent identifiers¶
The FAIR principles, under the
Findability and the
Accessibility chapters respectively, state that:
F1. (Meta)data are assigned a globally unique and persistent identifier
A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
1.1. Main Objectives¶
The main goals of this recipe are therefore:
To understand the purpose of a globally unique and persistent identifier and how they can be used to retrieve the associated (meta)data using a standardized communication protocol. To provide explanations on how to generate globaly unique identifiers, explain what IRIs are and how they can be generated, retrieved and resolved.
From these principles, it is necessary to explain three key processes, which are:
1.1.1. Identifier minting¶
Identifier minting is fundamentally about the
authority deciding identity.
the tax office
the HR department
the company registry
1.1.2. URI construction¶
URI construction is fundamentally about
scoping the authority.
1.1.3. URI Resolution:¶
URI resolution is fundamentally about
directing requests to the relevant identified entity
The standard approach would be resolving a
HTTP GET request using content negotiation to choose between different representations of the resource.
All these key points will be developed in this recipe.
1.3. Minting identifiers¶
*Identifier minting is fundamentally about the authority deciding identity.
Identifiers are used to tag, identify, find and retrieve entities which are part of a collection or a resource maintained by some organization. This organization is the
authority which rules over that area of knowledge. The core assumption is that identifiers must be unique, that is they can not be shared and there is a 1 to 1 relation between the ‘identifier’ and the entity it identifies.
With isolated systems, disconnected from any other system, the risk of identifier collision is null but two isolated systems can create local identifiers which could be completely identical but which denote completely different entities. In fact, this happens all the time.
So these identifiers are said to be locally unique, as there is no guarantee these are unique to all other systems that exist, in other words, that they are globally unique.
1.4. How to produce globally unique identifiers?¶
There are 2 ways to produce non-resolvable, globally unique identifiers:
1.4.1. UUID based identifiers¶
With this approach, the notion of
universally unique is a probabilistic one. The probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion. The likelihood of collision (generation of the exact same identifier) is extremely small but not null. Therefore, with an ever increasing number of digital resources to index, collisions should not be ruled out.
According to the RFC4122 specifications, a UUID is an identifier that is unique across both space and time, with respect to the space of all UUIDs. Since a UUID is a fixed size and contains a time field, it is possible for values to rollover (around A.D. 3400, depending on the specific algorithm used). A UUID can be used for multiple purposes, from tagging objects with an extremely short lifetime, to reliably identifying very persistent objects across a network.
Key Fact about UUID:
no centralized authority is required to administer them
content independent, entirely disconnected from the identify they can be associated with for identification purpose
generation on demand can be completely automated
completely semantic free (opaque) identifier
Generation in Python The following code snippet shows the generation of a UUID using the Python uuid package.
import uuid id = uuid.uuid4() print(id) 5b6d0be2-d47f-11e8-9f9d-ccaf789d94a0
1.4.2. Hash based identifiers¶
This approach uses 2 inputs:
a cryptographic hashing algorithm implemented as a software function
a digital resource (e.g. a file)
Indeed, the approach generates an identifier by using all or some of the content of the digital resource as input to the cryptographic hashing function to compute a unique string, which is therefore a signature (or fingerprint) of the digital resource 9, 10. A number of algorithms can be used and some are already widely used such as Message Digest algorithm MD5 specified by the RFC1321 3, the Secure Hash Algorithm (SHA1), Secure Hash Algorithm 2 (SHA256), Secure Hash Algorithm 3 (SHA3) or BLAKE2b-256 (RFC 7693) 1. The first two are considered obsolete, while the latter two are most advanced and approved by NIST.
Key fact about hash identifiers: It is not possible to reconstruct the original data from these hash strings. These are only fingerprints, which can therefore only be used to do the following tasks:
public key encryption
126.96.36.199. Generation in Python¶
The following code snippet shows the generation of a hash for a string using the Python hashlib package:
import hashlib # encode it to bytes using UTF-8 encoding message = "creating globally unique identifiers for FAIR data".encode() # hash with MD5 (not recommended) print("MD5:", hashlib.md5(message).hexdigest()) # hash with SHA-2 (SHA-256 bits & SHA-512 bits long) print("SHA-256:", hashlib.sha256(message).hexdigest()) print("SHA-512:", hashlib.sha512(message).hexdigest()) # hash with SHA-3 print("SHA-3-256:", hashlib.sha3_256(message).hexdigest()) print("SHA-3-512:", hashlib.sha3_512(message).hexdigest()) # hash with BLAKE2 (256-bits BLAKE2s & 512-bits BLAKE2c) print("BLAKE2s:", hashlib.blake2s(message).hexdigest()) print("BLAKE2b:", hashlib.blake2b(message).hexdigest())
188.8.131.52. Generation of hashes using curl¶
The following snippet shows how a b2sum hash can be generated using
curl https://fairplus.github.io/cookbook-dev/intro | b2sum --length 256 --binary 24d470987fda1278c63c3f97ab30869b821906449f3ecf290ee48086b8215668
In our context, the use of the hashing function is to generate a unique key which may be used to generate a URL. This simply indicates a technical option for generating opaque URL, not that it is necessarily the most widespread approach.
1.5. Understanding Uniform Resource Locators (URLs)¶
URI construction is fundamentally about scoping the authority.
Having covered the technical details to generated globally unique identifiers, it is now necessary to discuss the issue making identifiers resolvable (a notion also known as
In other words, in order to create globally unique identifiers
for the web, it is necessary to understand what Uniform Resource Locators (a.k.a
URL) are and how to construct them for use with the Hypertext Transfer Protocol.
This results in URLs of the following form
userinfo host port ┌──┴───┐ ┌──────┴──────┐ ┌┴┐ https://firstname.lastname@example.org:123/forum/questions/?tag=networking&order=newest#top └─┬─┘ └───────────┬──────────────┘└───────┬───────┘ └───────────┬─────────────┘ └┬┘ scheme authority path query fragment
The structure of
URL, according to the World Wide Web Consortium (W3C) specification, is as follows:
URI = scheme:[//authority]path[?query][#fragment]
In this structure, the
scheme defines the protocol or application to use to obtain the resource. The list of official
scheme is maintained by the Internet Assigned Numbers Authority and the following link (https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml) holds the most up to date version.
The most relevant
URI scheme in the context of FAIR data and Linked Open Data are
https which denote the
Hypertext Transfer Protocol and the
Hypertext Tranfer Protocol Secure.
authority, the notion of
host corresponds to the
Internet Protocol (IP) address of a server hosting a resource. Often, the IP address corresponding to the
host is given a
host name such as
host name should be a
Qualified Domain Name at minima, or a
Fully Qualified Domain Name (FQDN) ideally and registered with the
Domain Name Service (DNS) which allows the resolution (lookup) between the
ip address and the
it is often the case the
authority is reduced to the
host , which is then referred to as a ‘namespace’ or ‘domain name’ in an abuse of language.
host is in fact further specified by 3 element
subdomaincan be defined in the Domain Name Service and belong to the main domain. Technically, to add a subdomain pointing to the domain name, one needs to create/add a CNAME to the DNS for a registered domain name
path defines the directory on the
host where the resource is located and consists of a sequence of zero or more path segements separated by a
query is an optional part of the URL syntax that starts with a
?. Typically the
query component consists of a service of key-value pairs separated by an
In the context resolvable identiers,
query components should be avoided.
fragement is an optional part of the URL syntax that starts with a
#. It identifies a component within the returned resource and is used for client side processing, e.g. to scroll to a particular section within a webpage.
1.6. Generating Resolvable URLs¶
In the context of FAIR data, resources on the web must have unique, persistent, and resolvable identifiers.
In order to achieve the capability of
persistence, it is necessary for the resource identifiers to comply to the RFC 3986 IETF standard for URIs (and IRIs, which are URI extended to cope with unicode). This means that it must comprise the following components:
an authority: www.example.com
optionally a path:
a local identifier (such as database accession number, such as P12133 from uniprot) or a globally unique identifier (such as a UUID or hash code).
In a virtual example which uses a UUID for the local identifier and does not use a path, it looks like this:
Taking a real life example, to make the
UniProt accession number globally unique, one needs to provide the context in which the accession number is unique. This can be done by converting it into an
International Resource Identifier (IRI - commonly referred to as a URL) by appending the local identifier onto a namesapce.
You should only use a
namespaceover which you have ownership (the authority), otherwise you cannot guarantee that the minted IRI will be globally unique; the organization or person who owns the namespace may already, or at some point in the future, use the IRI that you created for some other purpose.
In the case of UniProt, the resource has provided IRIs for each page about a protein as well as separate IRIs for the protein itself; this is because the page is not the concept of the protein by a document that describes properties of the protein. This separation of identities is achieved by using different namespaces for the different types of resource.
UniProt P38398 web page IRI: https://www.uniprot.org/uniprot/P38398
UniProt protein P38398 IRI: http://purl.uniprot.org/uniprot/P38398
Once such URIs are available, one may also turn them into compact identifiers called CURIEs. This will be discussed further in the next section.
1.7. Identifier Resolution - Enabling persistence through indirection¶
This relates to the following FAIR principle mentioned in the introduction:
A1. (Meta)data are retrievable by their identifier using a standardised communications protocol.
URI resolution is fundamentally about directing requests to the relevant identified entity.
The standard approach would be resolving a
HTTP GET request using content negotiation to choose between different representations of the resource.
A PURL is a
persistent URL, meaning that it provides a permanent address to access a resource on the web.
To understand the notion of PURL, one needs to first get familiar with the notion of
url indirection (also known as
url redirect or
url forwarding ), which refers to the practice of providing a stable, fixed web address/url, but setting it up so that it points to another content, which may be periodically modified.
When a user retrieves a PURL, they will be
redirected to the current location of the resource.
When an author needs to move a page, they can update the PURL to point to the new location.
The practice of indirection comes handy as it ensures invariant url address for resources which are known to change, owing to version changes for instance or owing to change in ownership.
We can see this practice in action with the reliance on purl.org url for identifying OBO Foundry resources. For instance, the following url
http://purl.obolibrary.org/obo/stato.owl is a redirect to the latest release of the file, which is https://raw.githubusercontent.com/ISA-tools/stato/dev/releases/latest_release/stato.owl.
PURLs with a
common prefix are grouped together into domains. Each domain has a single maintainer who can add new PURLs to the domain and make changes to existing PURLs within the domain.
FAIR Principle A1 states that:
(meta)data should be retrievable by its identifier.
When the identifier is not a resolvable URL, then
Identifier Resolution Services are required that know how to map an IRI to a location for the data.
1.7.1. Introducing CURIES or Compact URIs¶
CURIEs (short for compact URIs) are defined by a World Wide Web Consortium Working Group Note CURIE Syntax 1.0, and provide a human readable shortening of IRIs.
The CURIE consists of a
namespace prefix followed by the
There are some widely used and defined CURIEs such as DOIs and ISBN numbers. For example the DOI
[doi:10.1038/sdata.2016.18] refers to the FAIR Principles paper. The Digital Object Identifier System web site (https://www.doi.org/) provides a resolution service for DOIs. The service is available as a web form on the site or can be used by appending a DOI to the website.The client will be redirected to the URL where the resource about the concept is located, e.g. for the FAIR Data Principles paper we can use the URL https://www.doi.org/10.1038/sdata.2016.18 to resolve the paper’s DOI. This results in the client being taken to the page at https://www.nature.com/articles/sdata201618.
Namespaces can be defined by convention, such as the case with
doi, and registered with services to allow for the resolution of CURIEs (see Identifier Resolution Services below). These are extensively used to map CURIEs to URLs that can be resolved.
Going back to our Life Science context, we can use the following CURIE
[uniprot:P38398] to refer to the UniProt record for the protein.
This is very useful for including unambiguous, global identifiers in scientific articles.
1.7.2. Identifier Resolution services¶
The PURL system is a service of the Internet Archive, which provides an interface to administer domain. For more information about the service, visit https://archive.org/services/purl/help
Permanent Identifiers for the Web. Secure, permanent URLs for your Web application that will stand the test of time.
authority registration service
Send a request to add a redirect to the email@example.com mailing list. Make sure to include the URL that you want on w3id.org, the URL that you want to redirect to, and the HTTP code that you want to use when redirecting. An administrator will then create the redirect for you.
Identifiers.org is a Resolution Service provides consistent access to life science data using
Compact Uniform Resource Identifiers, hosted by the EBI provides a resolution service, both as a web form and through the URL pattern 7.
Compact Identifiersconsist of an
local provider designated
accession number(prefix:accession). The resolving location of
Compact Identifiersis determined using information that is stored in the Identifiers.org Registry. Datasets can register their namespace
prefixtogether with their
identifier pattern. The service can then be used in the same way as the DOI resolution service. So for the UniProt page about BRCA1, we can resolve the CURIE
[uniprot:P38938]using Identifiers.org. This means that the URL https://identifiers.org/uniprot:P38938 resolves to the UniProt page https://www.uniprot.org/uniprot/P38938.
Name2Things (N2T) is a Resolution Service, maintained at the California Digital Library (CDL) within the University of California (UC) Office of the President. CDL supports electronic library services for ten UC campuses and affiliated law schools, medical centers, and national laboratories, as well as hundreds of museums, herbaria, botanical gardens, etc. Similar to URL shorteners like bit.ly, N2T serves content indirectly. N2T can store more than one “target” (forwarding link) for an identifier, as well as any kind or amount of metadata (descriptive information) N2T.net is also a “meta-resolver”. In collaboration with identifiers.org, it recognizes over 600 well-known identifier types and knows where their respective servers are. Failing to find forwarding information for a specific individual identifier, it uses the identifier’s type to look for an overall target rule.
Bioregistry is a Resolution Service, developed in a GitHub repository. Like Identifiers.org it has a registry, but also a registry of registries, and it imports data from Identifiers.org and Name-to-Thing but extends beyond identifiers for things but also supports, for example, ontologies. As a community effort, new namespace prefixes and their identifier patterns can be registered via GitHub issues. Compact identifiers are supported and the URL https://bioregistry.io/chebi:138488 resolves to the ChEBI page https://www.ebi.ac.uk/chebi/searchId.do?chebiId=CHEBI:138488. Bioregistry provides an API to query the registry itself.
For more details, see the Identifier Resolution Services recipe.
The difference between global and local identifiers;
How to convert a local identifier into a global one;
Opaque and transparent identifiers
We have given an overview of the different services available for handling identifiers.
But we can not conclude this section on persistent identifiers without stressing how central they are to the production of Linked Data or Linked Open Data, which rely on 3 W3C standards: URI5,2,11, RDF4 and HTTP.
1.8.1. What should I read next?¶
Blake2 specifications. URL: https://tools.ietf.org/html/rfc7693.
Cool uris don’t change. URL: https://www.w3.org/Provider/Style/URI.
Md5 specifications. URL: https://tools.ietf.org/html/rfc1321.
Rdf concepts. URL: https://www.w3.org/TR/rdf-concepts/.
Url. URL: https://tools.ietf.org/html/rfc1738.
Rachana Ananthakrishnan, Kyle Chard, Mike D’Arcy, Ian Foster, Carl Kesselman, Brendan McCollam, Jim Pruyne, Philippe Rocca-Serra, Robert Schuler, and Rick Wagner. An open ecosystem for pervasive use of persistent identifiers. In Practice and Experience in Advanced Research Computing, PEARC ‘20, 99–105. New York, NY, USA, 2020. Association for Computing Machinery. URL: https://doi.org/10.1145/3311790.3396660, doi:10.1145/3311790.3396660.
N. Juty, N. Le Novère, and C. Laibe. Identifiers.org and MIRIAM Registry: community resources to provide persistent identification. Nucleic Acids Res, 40(Database issue):D580–586, Jan 2012.
J. A. McMurry, N. Juty, N. Blomberg, T. Burdett, T. Conlin, N. Conte, M. Courtot, J. Deck, M. Dumontier, D. K. Fellows, A. Gonzalez-Beltran, P. Gormanns, J. Grethe, J. Hastings, J. K. Hériché, H. Hermjakob, J. C. Ison, R. C. Jimenez, S. Jupp, J. Kunze, C. Laibe, N. Le Novère, J. Malone, M. J. Martin, J. R. McEntyre, C. Morris, J. Muilu, W. Müller, P. Rocca-Serra, S. A. Sansone, M. Sariyar, J. L. Snoep, S. Soiland-Reyes, N. J. Stanford, N. Swainston, N. Washington, A. R. Williams, S. M. Wimalaratne, L. M. Winfree, K. Wolstencroft, C. Goble, C. J. Mungall, M. A. Haendel, and H. Parkinson. Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoS Biol, 15(6):e2001414, Jun 2017.
Adam Retter. Archival catalogue record identifiers. URL: https://blog.adamretter.org.uk/archival-catalog-identifiers/.
Adam Retter. Archival identifiers for digital files. URL: https://blog.adamretter.org.uk/archival-identifiers-for-digital-files/.
Leo Sauermann and Richard Cyganiak. Cool uris for the semantic web, w3c semantic web education and outreach interest group note. 2008. URL: https://www.w3.org/TR/cooluris/.