17.12.1. Introduction

If you ever had to do a literature search for a project, you probably could appreciate the great effort behind traversing the ever-expanding volumes of texts and trying to organize the extracted information. Throughout the last decades, some noticeable progress was made in using AI to automize the process. The modern machine learning approaches aim to identify, extract and store important information from unstructured texts. To make the extracted metadata active and FAIR, one often stores it in the form of a knowledge graph.

The pipeline for information extraction could be seen as a path of several steps:

  • Collecting the text data.

  • Avoiding ambiguity of entities with co-reference resolution.

  • Entity recognition and named entity linking.

  • Relationship extraction.

  • Storing the data as a knowledge graph. Collecting the text data

First, one collects the text to extract the data from. Text may be the collection of internal documents, articles, online content, or the result of picture descriptions produced by image-to-text algorithms.

Here as an example, we will collect a dataset of articles’ abstracts on the topic “cardiac amyloidosis”. In the biological domain, articles can be collected from the PubMed database using biopython, for the sake of simplicity we will only go through the first 20 articles that come up in the search.

!pip install biopython
#importing libraries
from Bio import Entrez

def search(query, max_papers=20):
    Get IDs of papers on the given topic from the pubmed database.
    handle = Entrez.esearch(db='pubmed',
    results = Entrez.read(handle)
    return results
def fetch_details(id_list):
    Get details on each paper (including the abstract).
    ids = ','.join(id_list)
    handle = Entrez.efetch(db='pubmed',
    results = Entrez.read(handle)
    return results
results = search('cardiac amyloidosis')
id_list = results['IdList']
papers = fetch_details(id_list) Avoiding ambiguity of entities with coreference resolution

The prepared text should go through the coreference resolution model. In a nutshell, this process should replace all ambiguous words in a sentence so that the text doesn’t need any extra context to be understood. For example, personal pronouns are being replaced with a referred person’s name. Although a number of approaches exist to perform the task, one of the most recently developed is crosslingual coreference from the spaCy universe. spaCy is a python library that provides an easy way to create pipelines for natural language processing.

!pip install crosslingual-coreference==0.2.3 spacy-transformers==1.1.5 wikipedia neo4j
!pip install --upgrade google-cloud-storage
!pip install transformers==4.18.0
!python -m spacy download en_core_web_sm
import spacy
import crosslingual_coreference

# Configure the `Device` parameter:
DEVICE = -1 # Number of the GPU, -1 if want to use CPU

# Add coreference resolution model:
coref = spacy.load('en_core_web_sm', disable=['ner', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])
coref.add_pipe("xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": DEVICE}) Entity recognition and named entity linking

The next step is known as named entity recognition (NER). Here, we want to extract all important entities from the sentences. Depending on the use case, one may need to train a model to recognize entities of a specific type. For example, this tutorial details a way to train a model to recognize some entities from a biomedical domain. However, the spaCy universe also provides some pre-trained models to recognize entities, which we are going to use in our example.

Then, one needs to standardize the entities and map them to an existing ontology. The process is known as entity linking. Here, we map entities from the text to corresponding unique identifiers from a target knowledge base, for example, Wikipedia. One can also use databases relevant to the specific topic of the texts. We will try to map our entities to the NCI Thesarius, for simplicity choosing the first match as a mapping.

Note, that in principle that is not always the best choice and one can use different similarity metrics to identify the best matching term in the ontology.

A mapping to the Wikipedia terms is performed in this tutorial. Relationship Extraction

After entity linking to get standardized triples (object, relation, subject) for a knowledge graph, we extract the relationships between the identified entities. The Rebel project, which is also available as a spaCy component, allows us to extract both entities and relations in one step, which we can use in our pipeline.

To implement our approach of linking the entities to NCIT, we can rewrite the set_annotations function from Rebel as specified here and turn call_wiki_api function into call_ncit function

# Add rebel component https://github.com/Babelscape/rebel/blob/main/spacy_component.py
import requests
import re
import hashlib
from spacy import Language
from typing import List
import pandas as pd

from spacy.tokens import Doc, Span

from transformers import pipeline

def call_ncit_api(item):
    url = f"https://www.ebi.ac.uk/ols/api/search?q={item}&ontology=ncit"
    data = pd.DataFrame(requests.get(url).json().get('response').get('docs'))
    # Return the first id (A simplistic non-perfect way for mapping)
    return data["label"][0]
    return 'id-less'

def extract_triplets(text):
    Function to parse the generated text and extract the triplets
    triplets = []
    relation, subject, relation, object_ = '', '', '', ''
    text = text.strip()
    current = 'x'
    for token in text.replace("<s>", "").replace("<pad>", "").replace("</s>", "").split():
        if token == "<triplet>":
            current = 't'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
                relation = ''
            subject = ''
        elif token == "<subj>":
            current = 's'
            if relation != '':
                triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})
            object_ = ''
        elif token == "<obj>":
            current = 'o'
            relation = ''
            if current == 't':
                subject += ' ' + token
            elif current == 's':
                object_ += ' ' + token
            elif current == 'o':
                relation += ' ' + token
    if subject != '' and relation != '' and object_ != '':
        triplets.append({'head': subject.strip(), 'type': relation.strip(),'tail': object_.strip()})

    return triplets

        "model_name": "Babelscape/rebel-large",
        "device": 0
class RebelComponent:
    def __init__(
        model_name: str,
        device: int,
        assert model_name is not None, ""
        self.triplet_extractor = pipeline("text2text-generation", model=model_name, tokenizer=model_name, device=device)
        self.entity_mapping = {}
        # Register custom extension on the Doc
        if not Doc.has_extension("rel"):
          Doc.set_extension("rel", default={})

    def get_wiki_id(self, item: str):
        #mapping = self.entity_mapping.get(item)
        #if mapping:
        #  return mapping
        res = call_ncit_api(item)
        self.entity_mapping[item] = res
        return res

    def _generate_triplets(self, sent: Span) -> List[dict]:
          output_ids = self.triplet_extractor(sent.text, return_tensors=True, return_text=False)[0]["generated_token_ids"]["output_ids"]
          extracted_text = self.triplet_extractor.tokenizer.batch_decode(output_ids[0])
          extracted_triplets = extract_triplets(extracted_text[0])
          return extracted_triplets

    def set_annotations(self, doc: Doc, triplets: List[dict]):
        for triplet in triplets:

            # Remove self-loops (relationships that start and end at the entity)
            if triplet['head'] == triplet['tail']:

            # Use regex to search for entities
            head_span = re.search(triplet["head"], doc.text)
            tail_span = re.search(triplet["tail"], doc.text)

            # Skip the relation if both head and tail entities are not present in the text
            # Sometimes the Rebel model hallucinates some entities
            if not head_span or not tail_span:

            index = hashlib.sha1("".join([triplet['head'], triplet['tail'], triplet['type']]).encode('utf-8')).hexdigest()
            if index not in doc._.rel:
                # Get wiki ids and store results
                doc._.rel[index] = {"relation": triplet["type"], "head_span": {'text': triplet['head'], 'id': self.get_wiki_id(triplet['head'])}, "tail_span": {'text': triplet['tail'], 'id': self.get_wiki_id(triplet['tail'])}}

    def __call__(self, doc: Doc) -> Doc:
        for sent in doc.sents:
            sentence_triplets = self._generate_triplets(sent)
            self.set_annotations(doc, sentence_triplets)
        return doc

After redefining the Rebel spaCy component, we include it into our pipeline:

DEVICE = -1 # Number of the GPU, -1 if want to use CPU

# Add coreference resolution model
coref = spacy.load('en_core_web_sm', disable=['ner', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer'])
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": DEVICE})

# Define rel extraction model
rel_ext = spacy.load('en_core_web_sm', disable=['ner', 'lemmatizer', 'attribute_rules', 'tagger'])
rel_ext.add_pipe("rebel", config={
    'device':DEVICE, # Number of the GPU, -1 if want to use CPU
    'model_name':'Babelscape/rebel-large'} # Model used, will default to 'Babelscape/rebel-large' if not given

Now we can text the pipeline on two simple sentences:

input_text = "High fever is very dangerous. It can be treated with paracetamol."

coref_text = coref(input_text)._.resolved_text

doc = rel_ext(coref_text)

for value, rel_dict in doc._.rel.items():
    print(f"{value}: {rel_dict}")

#0440ea848947d2677bc11443f99f20f67ce0a1bc: {'relation': 'subclass of', 'head_span': {'text': 'High fever', 'id': 'High Grade Fever'}, 'tail_span': {'text': 'dangerous', 'id': 'DRRI-2 - A: Dangerous Military Duties'}}
#8aa25d264897bd007d389890b2239c2b9c07fa0b: {'relation': 'drug used for treatment', 'head_span': {'text': 'High fever', 'id': 'High Grade Fever'}, 'tail_span': {'text': 'paracetamol', 'id': 'Acetaminophen Measurement'}}
#d91bef9bfc94439523675b5d6a62e1f4635c0cdd: {'relation': 'medical condition treated', 'head_span': {'text': 'paracetamol', 'id': 'Acetaminophen Measurement'}, 'tail_span': {'text': 'High fever', 'id': 'High Grade Fever'}}

You can see, that on the coreference step the “it” pronoun in the second sentance was replaced by the unambiguous “High fever” entity. After that the rebel model has extracted the trios of subject, relation and object and mapped them to the NCIT model. Note, that the mapping here is far from perfect. For example, the entity ‘dangerous’ was mapped to the ‘DRRI-2 - A: Dangerous Military Duties’ in NCIT. this is because in our mapping procedure for simplisity we have chosen the first result for the term in the NCIT database. To improve this, one would need to develop a more complex mapping algorithm. Storing the results

The final subject, relation, and object trios can be stored as either a labeled property graph or as an RDF graph. The guidelines to store the results as a neo4j labeled property graph are given here. Here we will give an approach to store the results as an RDF graph by using rdflib library in python.

Rdflib allows the creation of entities with known URIs with the URIRef command. Also, one can create a custom namespace with new entities and relations.

!pip install rdflib
from rdflib import Graph
from rdflib import URIRef, BNode, Literal, Namespace
import json

def Capitalise_underscore(relation):
  return relation.capitalize().replace(' ','_')

def ncit_iri(item):
    url = f"https://www.ebi.ac.uk/ols/api/search?q={item}&ontology=ncit"
    data = pd.DataFrame(requests.get(url).json().get('response').get('docs'))
    # Return the first id
    return data["iri"][0]
    return 'id-less'

EX = Namespace('http://example.org./')

relations = pd.DataFrame()

results = search('cardiac amyloidosis')
id_list = results['IdList']
papers = fetch_details(id_list)
for i, paper in enumerate(papers['PubmedArticle']):
    print("{}) {}".format(i+1, paper['MedlineCitation']['Article']['ArticleTitle']))
    abstract_text_json = json.dumps(papers['PubmedArticle'][i]['MedlineCitation']['Article']['Abstract']['AbstractText'])
    abstract_text = ' '.join(json.loads(abstract_text_json))
    coref_text = coref(abstract_text)._.resolved_text
    doc = rel_ext(coref_text)
    for value, rel_dict in doc._.rel.items():
      subject_iri = ncit_iri(rel_dict['head_span']['text'])
      object_iri = ncit_iri(rel_dict['head_span']['text'])
      if subject_iri != 'id=less':
        subj = URIRef(subject_iri)
        subj = EX[rel_dict['head_span']['text']]
      if object_iri != 'id=less':
        obj = URIRef(object_iri)
        subj = EX[rel_dict['head_span']['text']]  
      pred = EX[Capitalise_underscore(rel_dict['relation'])]
      g.add((subj, pred, obj))        
    df = pd.DataFrame.from_dict(doc._.rel).transpose()
    df['subject_text'] = df.head_span.apply(lambda x: x['text'])
    df['subject_id'] = df.head_span.apply(lambda x: x['id'])
    df['object_text'] = df.tail_span.apply(lambda x: x['text'])
    df['object_id'] = df.tail_span.apply(lambda x: x['id'])

    df = df.drop(["head_span", "tail_span"], axis = 1)

    relations = pd.concat([relations, df])

Finally, we can visualize the resulting graph and export it in .ttl format.

print(g.serialize(format = 'ttl'))

Visualization of the graph:

import networkx as ntx
import matplotlib.pyplot as plot

graph = ntx.from_pandas_edgelist(relations, "subject_text", "object_text", edge_attr=True, create_using=ntx.MultiDiGraph())

plot.figure(figsize=(10, 10))
posn = ntx.spring_layout(graph)
ntx.draw(graph, with_labels=True, node_color='green', edge_cmap=plot.cm.Blues, pos = posn)