2. InChI and SMILES identifiers for chemical structures¶
2.1. Main Objectives¶
The main purpose of this recipe is:
To take an SDF file, validate the content for chemical inconsistencies, and generate InChIs, InChIKeys, and SMILES for each entry in the SDF file.
2.2. Creating InChI and SMILES identifiers for chemical structures¶
To run the below scripts, you need a Groovy installation. The Groovy scripts use version 2.7.1 of the Chemistry Development Kit (see 2). This library and its use in Groovy is further explain in the book Groovy Cheminformatics with the Chemistry Development Kit. Check this git repository for more detailed use instructions and where to find the tools: https://github.com/FAIRplus/fairplus-sdf
2.2.1. Record validation¶
When generating InChIs, the InChI library (see 1) may return several success states reflecting issues with the compound record in the SDF file, including: WARNING and ERROR. This first script reports such issues:
groovy badRecords.groovy -f foo.sdf
The output may look like this:
Sulfinpyrazone Omitted undefined stereo WARNING Isosorbide mononitrate Charges were rearranged WARNING Compound52 Proton(s) added/removed WARNING
2.2.2. Calculate InChls¶
Similarly, InChIKeys can be generated:
groovy inchikeys.groovy -f foo.sdf
When the success state is ERROR, nothing is outputted.
2.2.3. Calculate SMILES strings¶
The last script calculates a SMILES for each entry in the SDF file:
groovy smiles.groovy -f foo.sdf
This recipe explained who to validate the chemical structures in an SDF file, and convert them to SMILES, InChI, and InChIKey. The latter can then be used with BridgeDb and its metabolite ID mapping databases to get additional identifiers.
2.3.1. What to read next?¶
Jonathan M. Goodman, Igor Pletnev, Paul Thiessen, Evan Bolton, and Stephen R. Heller. Inchi version 1.06: now more than 99.99. Journal of Cheminformatics, may 24 2021.
Egon Willighagen, John W Mayfield, Jonathan Alvarsson, Arvid Berg, Lars Carlsson, Nina Jeliazkova, Stefan Kuhn, Tomáš Pluskal, Miquel Rojas-Chertó, Ola Spjuth, Gilleain Torrance, Chris T. Evelo, Rajarshi Guha, and Christoph Steinbeck. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. Journal of Cheminformatics, jun 6 2017.