# Introduction to the VCF2FHIR python library

<br/>
<br/>

````{panels_fairplus}
:identifier_text: FCB059
:identifier_link: 'https://w3id.org/faircookbook/FCB059'
:difficulty_level: 4
:recipe_type: hands_on
:reading_time_minutes: 30
:intended_audience: principal_investigator, data_manager, data_scientist
:maturity_level: 2
:maturity_indicator: 1, 2
:has_executable_code: yeah
:recipe_name: Using the VCF2FHIR python library
````


* author: philippe.rocca-serra@oerc.ox.ac.uk
* license: CC-BY-4.0
* version: 1.0
* creation-date: 2021.10.22


This python library (in early stage of development) by Dolin et al, 2021 [1] provides an **initial capability** to convert genetic variation information stored in a standard [Variant Call File (VCF)](http://samtools.github.io/hts-specs/VCFv4.3.pdf) into a JSON-based HL7 FHIR message, compliant with `HL7 FHIR Genomics Report` guidelines.

This notebook offers a simple way for anyone interested in how FAIR principles can be connected to `Clinical World` to try it out for themselves.

* Main Features:
    - supports `simple variants` (SNVs, MNVs, Indels)

* Limitations:
     - **does not** support `structural variants`
     - This software is not intended for use in production systems



## Let's get going by importing all the necessary python libraries

In [201]:

import os
import json
import logging
import vcf2fhir


### VCF2FHIR python library`vcf2fhir` main method is called `Converter` and takes a number of arguments, most of which are optional.

* Required arguments:

- `vcf_filename (required)`: the path to a text-based or bgzipped VCF file.

    **IMPORTANT:**
     - Valid path and filename without whitespace must be provided.
     - VCF file must conform to VCF Version 4.1 or later.
     - FORMAT.GT must be present.
     - Multi-sample VCFs are allowed, but only the first sample will be converted.
     - bgzipped VCF files are allowed but then the additional argument `has_tabix` must be set to `True` and a tabix index file must be provided. The Tabix file must have the same name as the bgzipped VCF file, with a ‘.tbi’ extension, and must be in the same folder.


- `ref_build (required)`: Genome Reference Consortium genome assembly to which variants in the VCF were called.

    **IMPORTANT:**
    - Must be one of ‘GRCh37’ or ‘GRCh38’.
    
* Optional arguments are:    
- `patient_id` (optional):
- `conv_region_dict`
- `conv_region_filename`
- `annotation_filename (optional)`
- `region_studied_filename (optional)`
- `nocall_filename (optional):`
- `ratio_ad_dp (optional)(default value = 0.99)`
- `genomic_source_class (optional)(default value = somatic)`

For more information about those options, refer to the [library documentation](https://vcf2fhir.readthedocs.io/en/latest/API.html).



#### Invoking the converter is as simple as the following command:

In [202]:
fhir = vcf2fhir.Converter('vcftests.vcf','GRCh37')

#### Invoking the `convert()` submethod to serialize the information as a HL7 FHIR JSON message to a default file output.

In [203]:
fhir.convert()

#### Performing both actions in one go while using an additional optional argument

In [204]:
vcf2fhir.Converter('vcftests.vcf','GRCh38', 'patient01').convert()

#### Invoking the conversion and writing to a user defined file instead of the default file.

In [205]:
output=vcf2fhir.Converter('vcftests.vcf','GRCh37', 'patient01', ratio_ad_dp = 0.89).convert(output_filename='patient01.json')

### Peaking at the resulting JSON file:

In [206]:
with open('patient01.json','r') as input:
    fhirmsg=json.load(input)

print(json.dumps(fhirmsg, indent=4, sort_keys=True))

{
    "category": [
        {
            "coding": [
                {
                    "code": "GE",
                    "system": "http://terminology.hl7.org/CodeSystem/v2-0074"
                }
            ]
        }
    ],
    "code": {
        "coding": [
            {
                "code": "81247-9",
                "display": "Master HL7 genetic variant reporting panel",
                "system": "http://loinc.org"
            }
        ]
    },
    "contained": [
        {
            "category": [
                {
                    "coding": [
                        {
                            "code": "laboratory",
                            "system": "http://terminology.hl7.org/CodeSystem/observation-category"
                        }
                    ]
                }
            ],
            "code": {
                "coding": [
                    {
                        "code": "69548-6",
                        "display": "Genetic variant assessm

### Tracking conversion errors by activating the logger function

As with all conversions, things can go awry. It is therefore always good to log any error when executing code.
The authors of the `vcf2fhir` library provide 2 distinct logging modes, which we'll now use.

The `vcf2fhir` logging process simply builds on the well established python `logging` library and therefore to use it is as simple as using said library:

#### i. instantiate a logger and set a error logging level

In [207]:
general_logger = logging.getLogger('vcf2fhir.general')
general_logger.setLevel(logging.DEBUG)

#### ii. define an file as output and a formatter pattern

In [208]:
# create console handler and set level to debug
ch = logging.FileHandler('vcf2fhir-generic-errors.log')
ch.setLevel(logging.DEBUG)
# create formatter
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
# add formatter to ch
ch.setFormatter(formatter)
# add ch to logger
general_logger.addHandler(ch)

### Using the dedicated `invalid_record_logger`:

#### i. create a logger and pass it the specific vcf2fhir logger as follows:

In [209]:
invalid_record_logger = logging.getLogger('vcf2fhir.invalidrecord')

#### ii. configure the logger output file, error logging level and output formatting 

In [210]:
inv_ch = logging.FileHandler('vcf2fhir-invalid-record-errors.log')
inv_ch.setLevel(logging.DEBUG)
inv_ch.setFormatter(formatter)


```note```: *we reuse the `formatter` created previously

#### iii.  plug the error handler in the logger  and execute

In [211]:
invalid_record_logger.addHandler(inv_ch)

### We can now read the log file to check what happened during the conversion from VCF to FHIR JSON.

This is an important Quality Control step as the `vcf2fhir` is still experimental and under active development.
Therefore, users of the tool need to excert critical thinking, understand the parsing and conversion rules as well as understand
the limit of the envelop of the tool, as explained by the authors in their manuscript .


In [212]:
with open('vcf2fhir-invalid-record-errors.log','r') as input:
    lines=input.readlines()

print(lines[0:10])


["2021-10-22 09:37:17,523 - vcf2fhir.invalidrecord - DEBUG - Reason: VCF INFO.SVTYPE must be in ['INS', 'DEL', 'DUP', 'CNV', 'INV']. Record: Record(CHROM=M, POS=11551, REF=T, ALT=[TN[M:16141[]), considered sample: CallData(GT=1, PS=None)\n", "2021-10-22 09:37:17,524 - vcf2fhir.invalidrecord - DEBUG - Reason: VCF INFO.SVTYPE must be in ['INS', 'DEL', 'DUP', 'CNV', 'INV']. Record: Record(CHROM=M, POS=11562, REF=T, ALT=[TN]11:49883566]]), considered sample: CallData(GT=1)\n", "2021-10-22 09:37:17,525 - vcf2fhir.invalidrecord - DEBUG - Reason: Mitochondrial DNA with GT = 0 or its diploid, Record: Record(CHROM=M, POS=6021, REF=A, ALT=[C]), considered sample: CallData(GT=0|1, PS=60003, DP=15, AD=['12', '3'], CGA_RDP=12)\n", "2021-10-22 09:37:17,526 - vcf2fhir.invalidrecord - DEBUG - Reason: Mitochondrial DNA with GT = 0 or its diploid, Record: Record(CHROM=M, POS=6027, REF=A, ALT=[C]), considered sample: CallData(GT=0|1, PS=60003, DP=17, AD=['13', '4'], CGA_RDP=13)\n", "2021-10-22 09:37:17,5

### Conclusion


With this notebook, we've shown how to convert genetic variation information held in a VCF formatted file (it must comply with v4.1 or higher for this conversion to work) and generate a JSON-based HL7 FHIR Genomics Report message.

#### Why does this matter and how does it relate to FAIR:

The conversion from VCF to HL7 FHIR JSON message has to do with the `**I and R**` of `FAIR`, that is interoperability and reusability.
From a syntactic standpoint, the availability of genetic variation information at a granular level in an easily parseable form (JSON) is a gain for anyone looking at merging this information with other clinical messages.
From a semantic standpoint, the reliance on `LOINC` vocabulary to mark up the patterns defined in the HL7 FHIR Genomics Reports enhances interoperation between systems by provided unambiguous annotations.
Finally, as more systems are able to produce FHIR message from a variety of instruments or data sources, the availability of a FHIR message covering a subset of genetic variation available from testing facilities makes investigating and mining phenotypic and genotypic relations more straightforward.

However, one needs to remember that the capability affored by the `vcf2fhir` library is at an early stage and only supports simple cases. More efforts and more efforts is needed before a functionality is available at a Technical Readiness Level compatible with production systems.



### Reference:
````{dropdown} **Reference**
1. Dolin, R.H., Gothi, S.R., Boxwala, A. et al. vcf2fhir: a utility to convert VCF files into HL7 FHIR format for genomics-EHR integration. BMC Bioinformatics 22, 104 (2021). [https://doi.org/10.1186/s12859-021-04039-1](https://doi.org/10.1186/s12859-021-04039-1)
2. [https://github.com/elimuinformatics/vcf2fhir](https://github.com/elimuinformatics/vcf2fhir)
````


## Authors
````{authors_fairplus}
Philippe: Writing - Original Draft
Susanna: Writing - Review & Editing, Funding Acquisition
Danielle: Writing - Review & Editing
````

## License
````{license_fairplus}
CC-BY-4.0
````