6.1. Portals and lookup services¶

Recipe Overview

Reading Time

20 minutes

Executable Code

No

Difficulty

Introducing vocabulary portals and lookup services

Recipe Type

Technical Guidance

Audience

Data Manager, Data Scientist, Terminology Manager, System Administrator

Maturity Level & Indicator

not applicable

Cite me with FCB003

6.1.1. Main Objective¶

This recipe provides guidance on making a decision about the feasibility of a local deployment of existing open source ontology service software.

By the expression “ontology lookup service”, we refer to any type of application, standalone or Web-based, that enables the use of existing ontologies to support knowledge formalization and sharing, by fostering ontology-based descriptions of knowledge.

Therefore, tools useful to build, edit or maintain ontologies are not considered as ontology lookup services and thus are out of the scope of this document.

The recipe will:

define the most common selection criteria to be considered
provide general selection recommendations
provide recommendations for applying those selection criteria
give an overview about the most common open source ontology service software

6.1.2. Software selection criteria¶

This section presents the minimal criteria to take in account when analyzing alternatives for ontology-based services development and deployment. Additional criteria, including a more detailed analysis of technical features can be found on the resources mentioned in section Additional resources.

6.1.2.1. Functionality¶

Functionality of a software determines the range of capabilities and functions it can perform. Please note that specific functional selection criteria are beyond the scope of this recipe. Because functionality plays a very important role in the overall selection process it was added to show how it relates to the technical & architecture selection process.
Functional selection criteria are covered by Recipe FCB004

6.1.2.2. Interfaces¶

Interfaces allow read or write data from outside the ontology lookup service either by a human being or application.

For an ontology lookup service the most important interface features are:

Supported import and export ontology formats, e.g. OWL for uploading and downloading of ontologies.
Flexible query interface, e.g. to answer very specific ontology questions or to extend functional gaps of the ontology service. Currently, the most prominent query interface is SPARQL endpoint.
Application Programming Interface (API) technology, if you want to integrate other applications with the ontology lookup service it is essential that you can use widely used and supported technical standards. Currently, the most prominent API technology is REST API.

Please note that this recipe does not focus on specific interface functionality. It looks at interfaces only from an architectural and technical view.

6.1.3. Software Architecture¶

The software architecture shows the used hardware and software components and their relationship.

Regarding ontology lookup service selection the most important architectural aspects are:

Overall architecture complexity It gives you an idea whether the complexity is appropriate for solving your requirements. If you are trying to solve simple requirements with a very complex solution you might be on the wrong way.
Used tools and programming languages It gives you an idea what knowledge you will need for supporting the system or extending the functionality. You also get an overview of the impact to the overall complexity of the IT tools and programming languages used in your organization.
Modularity It gives you an idea whether you could replace few of the components by software/hardware preferred as standard in your company. It can give you also a hint, whether you can scale the application by adding more hardware/software resources.

6.1.4. Deployment model¶

The deployment model shows where and how the software can be installed and who owns the service.

Regarding ontology lookup service selection the most important deployment aspects are:

On premise versus cloud deployment Depending on your organisation policies and best practices, it might be the case that you want to install and maintain the software on your own infrastructure (on premise) or you prefer to buy it as a service on the cloud.
** Manual versus containerized versus virtual image installation**
- With a manual installation, you have full control over the installation, but you need typically more time.
  - A virtual image installation` bundles software together with the operating system, so it is easier to install, but typically you would need additional infrastructure and knowledge in your organisation to maintain all virtual images.
- A docker based installation is also easy to install and typically saves more hardware resources than a virtual image installation, because you share the operating system amongst multiple docker applications. Similar to virtual image installation you would need additional infrastructure and knowledge in your organisation to run and maintain all docker images.

6.1.5. Hardware and software requirements¶

The hardware requirements have mainly an impact on the costs. The software requirements have an impact on knowledge and costs (e.g. licences for operating systems).

The specific requirements of your organisation for data processing and storage will also influence the costs.

6.1.5.1. License model¶

The license model defines the consumer rights and the usage costs.

So it is essential that the licence model:

matches with your intended use
produces costs that are acceptable for your organisation from a price/performance point of view.

6.1.5.2. Database Technology for storing knowledge representation resources¶

The terminology database is a central component of knowledge management stack as it will store the ontologies.

The database system will typically also have a major impact on performance and scalability, because the bulk of ontology query processing will take place within the database system.

An ontology lookup service is defined to be database agnostic if its database component:

provides interfaces that use standard communication protocols.
provides a configurable access to the database.
allows any database product supporting a specific standards(e.g. SQL, SPARQL) to be used

A database agnostic ontology lookup service software will give you therefore the maximum freedom to use your defined database type standard.

6.1.5.2.1. Relational databases:¶

For storing metadata representable in flat taxonomies often Relational Database Management Systems (RDBMS) are used which represent data in tabular format.

6.1.5.2.2. Graph databases¶

From an ontology perspective, state of the art is to use a graph database. Two types of graph databases are currently available:

Labeled-Property A labeled-property graph model is represented by a set of nodes, relationships, properties, and labels.
Triple store A triple store database allows to store documents in RDF or OWL/RDF format natively and use the query from remote flexibility of a SPARQL endpoint. Also, Shape Constraint Language (SHACL) W3C standard could help to add quality checks.

6.1.5.3. Ontology language¶

The following ontology languages are widely used in the pharma research arena to model ontologies:

Simple Knowledge Organization System (SKOS) SKOS is a W3C standard which provides a standard way to represent knowledge organization systems using the Resource Description Framework (RDF). Encoding this information in RDF allows it to be passed between computer applications in an interoperable way 2
Web Ontology Language (OWL) OWL is defined by W3C and has become the de facto standard for ontology modelling. Therefore, OWL support is considered as a must for the ontology lookup service.
OBO The OBO file format is a biology-oriented language for building ontologies, based on the principles of OWL. A standard common mapping has been created for lossless round-trip transformations among both languages.

Persisting semantics artefacts expressed in various languages and representation frameworks in the same management system isn’t straightforward and conversions may be necessary. Even then, transformations may lead to information loss or difficulty in rendering information consistently.

6.1.5.4. Programming language¶

Programming languages are used to implement the data processing logic and user interface logic of the ontology lookup service.

The used programming languages will impact:

Required programming language knowledge you need for customization or support
Customization effort

6.1.5.5. Support¶

Important support aspects for a vocabulary service/ontology lookup service are:

Ongoing development of the tool
Frequency of issues and how fast they are solved
Which organization you can get support from, and what is the associated cost?

6.1.6. General selection considerations¶

Before looking into a concrete ontology service, some general thoughts are recommended. Two types of portal tools are available:

Open data portal tool Open data portals provide web-based interfaces designed to make it easier to find and access re-usable information. Some of them also support importing and exporting ontologies, including a SPARQL endpoint and provide ontology lookup service core functionality. An Open Portal Tool` is the underlying software that is used to implement the ontology portal functionalities.
Ontology portal tool A formal definition of an Ontology Portal does not exist. In the context of this document, an Ontology Portal is defined as an Open Data Portal that is specialized to ontologies as data and typically provides out of the box more fine granular ontology based functions. An Ontology Portal Tool is the underlying software that is used to implement the ontology portal functionalities.

If you have only minimum functional requirements in sharing ontologies it might be also an option for you to use an open data portal tool. In this case you could extend the functionality by developing additional web pages using the SPARQL endpoint. Having data and metadata in one database, such a solution would allow adding functionality that needs to combine ontologies with data (e.g. by annotation).

If you need fine granular ontology lookup service functionality, an ontology portal tool is recommended.

An additional option would be to combine an Open data platform tool with an Ontology portal tool in parallel. If both tools use a triplestore database, this should be possible in principle. The challenge will be that you would need additional customisation.

6.1.7. Choosing an ontology service software¶

As each organization may have its own preferences and requirements, there is no standard way to select the best suitable ontology service software. This section presents a general selection process based on the aforementioned selection criteria and gives guidance on a set of questions that should be answered in order to filter out tools that do not fit to use case at an early stage.

6.1.8. Overall Selection Process¶

A three-step selection approach is proposed:

High Level Gap Analysis First, it should be checked on a high level whether the tool does match the high level requirements.
Low Level Gap Analysis Only if the tool matches on a high level, more efforts should be invested in a finer analysis to find out whether the tool is still a suitable candidate.
From Candidates Selection Once the tool candidates have been identified, a ranking process can start by assigning fulfillment numbers to the weighted criteria reflecting the importance for the requesting organization. Finally, completing the ranking by summing up the total numbers from each atomic ranking criteria will allow to choose the tool, based on the highest scorer.

Following figure shows the overall process:

6.1.8.1. High Level Gap Analysis¶

As guidance for the High Level Gap Analysis, an analysis order based on selection criteria is proposed. The most important selection criteria contains one major question that has to be answered positively, either by the offerings of the tool or by some additional tool customization.

6.1.8.2. Low Level Gap Analysis¶

For a single low level selection criteria, no common recommendation for the “tool does not fit” decision can be given, because the decision highly depends on the preferences set in your specific context.

Instead, a set of questions will be presented per selection criteria. One has then to pick out those questions that are absolutely mandatory in a local context.

If such an absolutely mandatory question can not be solved by the tool or by tool customization, the “Tool does not fit” will fire.

Warning

Please note that for ontology functionality, no questions will be presented, because functionality is out of the scope of this recipe.

The following figures are showing typical questions one would have to answer for the low level analysis.

These questions may have to be adapted or extended depending on the local, specific needs.

6.1.9. Available open source software¶

6.1.9.1. EMBL-EBI Ontology Lookup Service¶

6.1.9.1.1. Overview¶

It is a repository for biomedical ontologies that aims to provide a single point of access to the latest ontology versions. It allows browsing the ontologies through the website as well as programmatically via the OLS API. It is part of the ELIXIR interoperability service.

6.1.9.1.2. Details¶

Functionality: Ontology Portal Tool
Interface: REST-style API supported, SPARQL endpoint under development.
Architecture: OLS has been developed with the Spring Data and Spring Boot framework.
1. Tomcat is used as a web server.
2. MongoDB is used for storing configuration yaml files.
3. Neo4J node-property graph database is used for storing and accessing the ontologies. OWL format is converted to a node-property representation.
Deployment model: It is available both as an on-premises and cloud-based solution. Docker based deployment is supported.
Requirements:
1. Hardware requirements. It requires a standard workstation, 1 GB main memory, and about 100 MB hard disk.
2. Software requirements. It is implemented as a Java Web Application to be deployed to the Tomcat 7.5 Java Application Container. It requires Java 8, Maven 3+ as dependency manager and build environment, MongoDB 2.7.8+ as database; and solr 5.2.1+ as indexing and search engine.
3. License model. Apache Software Licence (v. 2.0).
Databases: It supports the Neo4J graph store, which allows querying using Cypher query language. Reasoning supports two profiles: OWL2 and EL. Default is EL. The reasoners supported are HermiT and ELK.
Ontology Language: Custom translation of OBO and OWL 2 languages to the Neo4J graph model.
Programming Language: Java.

6.1.9.2. NCBO Bioportal Virtual Appliance (Ontology Portal Tool)¶

6.1.9.2.1. Overview¶

The National Center for Biomedical Ontology (NCBO).

6.1.9.2.2. Details¶

Functionality: Ontology Portal Tool
Interface: REST-style API supported, SPARQL endpoint
Architecture: Virtual Appliance defines the framework for the Web Service. The system internally uses the following components
1. A set of additional ruby based modules that implement the user interface and additional functionality can be found here.
2. 4Store triple store database is used to store and access ontologies.
3. Solr is used to create indexes out of description text metadata.
4. MySQL is used to store additional metadata.
5. MGrep is used for annotating text to ontologies.
Deployment model: It is available both as an on-premises and cloud-based solution. It is available as virtual VMWare Virtual Appliance or Amazon AWS AMI.
Requirements:
1. Hardware requirements.
  1. Minimum: 2 CPU (2 GHz), 4GB RAM, 20GB hard disk space.
  2. Recommended for heavier usage: 3 CPU (3 GHz), 8GB RAM (or more depending on the size/number of ontologies), 20GB hard disk space (or more depending on number/size of ontologies)
2. Software requirements. All software is already contained in the virtual image
  1. Operating system: CentOS (Linux)
  2. License model. Apache Software Licence (v. 2.0).
Databases: It supports the 4Store triple store and MySQL
Ontology Language: OBO, OWL
Programming Language: Ruby, Java.

6.1.9.3. Apache Marmotta (Open Data Platform Tool)¶

6.1.9.3.1. Overview¶

It is an Open Data Platform for Linked Data, which provides an open implementation of a Linked Data Platform that can be used, extended and deployed easily by organizations who want to publish Linked Data or build custom applications on Linked Data 1. It provides:

a) read-write Linked Data server for the Java EE stack

b) custom triple store built on top of RDBMS, with transactions, versioning and rule-based reasoning support

c) pluggable RDF triple stores based on Eclipse RDF4J,

d) LDP, SPARQL and LD Path querying

e) transparent Linked Data Caching

f) Integrated basic security mechanisms.

Warning

This project is now retired and is no longer supported or developed.

6.1.9.3.2. Details¶

Functionality: Open (Linked) Data Platform.
Interface: REST-style API, SPARQL endpoint supported.
Architecture, the architecture comprises the following tiers:
1. User Interface Layer. It mostly consists of admin and development interfaces and is not intended for end users.
2. Web-service Layer. It offers REST web-services to access most of the server functionality.
3. Service Layer. It offers CDI services to develop custom Java applications.
4. Model Layer. It offers persistence and data access functionality.
5. Persistence Layer. It is outside the Apache Marmotta Platform, which can use a number of Open Source database systems.
Deployment Model: It is available both as an on-premises and cloud-based solution. Docker based deployment is supported.
Requirements:
1. Hardware requirements. It requires a standard workstation, 1 GB main memory, and about 100 MB hard disk.
2. Software requirements. It is implemented as a Java Web Application that can, in principle, be deployed to any Java Application Container. It has been tested under Jetty 6.x and Tomcat 7.x. It requires Java JDK 6 or higher, Java Application Server (Tomcat 7.x or Jetty 6.x), and a database (PostgreSQL, MySQL). If not explicitly configured, an embedded H2 database will be used.
3. License model. Apache Software Licence (v. 2.0).
Databases: It supports the following triple store backends: (a.) KiWi Triple Store, (b.) Sesame Native, and (c.) BigData triple store. The default backend is the KiWi triple store, which stores all data in a relational database, and it is the only option that supports reasoning and versioning.
Ontology Language: OWL serialized as RDF/RDFS triples.
Programming Language: Java.

6.1.9.4. European Data Portal (Open Data Platform Tool)¶

6.1.9.4.1. Overview¶

European data portal (EDP) is an initiative by the Publications Office of the European Union and by the European Commission that aims to increase the impact of open data by making it easy to find and re-use by everyone.

It uses only open source software with extensions that are all available to the public for own use.

As a core component, CKAN open data portal software with DCAT-AP RDF extension is used.

It allows sharing various data formats e.g. tabular data, RDF data (e.g. ontologies) combining relational and semantic technologies.

The Triple Store database Virtuoso is used for storing ontologies.

For metadata in relational format, the PostgreSQL database is used as part of CKAN.

6.1.9.4.2. Details¶

Functionality: Open Data Portal
Interface: REST-style API, SPARQL endpoint supported.
Architecture:
1. CKAN manages and provides metadata content (datasets) in a central repository.
2. DRUPAL provides the Portal’s Home Page with editorial content (e.g. Portal’s objectives, articles, news, events, tweets, etc.) and links to an Adapt Framework based training platform.
3. The CKAN metadata is replicated into a Virtuoso triple store database via a CKAN synchronisation extension, in order to ensure that both repositories have the same set of metadata.
4. The SPARQL Manager component allows the user to enter and run SPARQL queries on the Virtuoso linked data repository.
5. The portal uses the SOLR search engine in order to separately search for editorial content in DRUPAL and for datasets in the CKAN repository.
6. The Harvester is a separate component that is able to harvest data from multiple data sources with different formats and APIs.
Deployment model: It is available both as an on-premises and cloud-based solution.
Requirements:
1. The setup of the EDP consists of 20 virtual servers per computer room and environment (PROD, TEST)
Databases: PostgreSQL RDBMS for CKAN catalogue, Virtuoso for RDF data
Ontology Language: RDF, RDFS, OWL 2
Programming Language: Python(CKAN), PHP(Drupal)

6.1.10. Conclusions¶

Determining which infrastructure to rely on for service terminologies and ontologies is a complex issue.

This FAIR Cookbook recipe gave an overview of non-functional criteria to take into consideration when appraising a software solution.

To complement this recipe, reading the following chapter is highly encouraged.

6.1.10.1. What to read next?¶

FAIRsharing records appearing in this recipe:

6.1.11. References¶

6.1.12. Authors¶

Authors

Name	Affiliation	Contribution
Kurt Dauth	Boehringer-Ingelheim AG	Writing - Original Draft
Emiliano Reynares	Boehringer-Ingelheim AG	Writing - Original Draft
Petros Papadopoulos	Heriot Watt University	Writing - Review & Editing
Karsten Quast	Boehringer-Ingelheim AG	Writing - Review & Editing
Philippe Rocca-Serra	University of Oxford	Writing - Review & Editing