This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Fragment achiral molecules in RDKit
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
This essay shows a simple way to fragment a non-chiral molecule (okay,
a "molecular graph") using RDKit, and then a complex way to re-connect
the SMILES fragment by manipulating the SMILES string at the syntax
level then recanonicalizing.
I'll start by fragmenting the methoxy group of
vanillin.
(I wrote an essay on
the
vanillin functional groups.) I want to turn the SMILES of
"c1(C=O)cc(OC)c(O)cc1" into two parts, '[*]OC' (the methoxy group) and
'O=Cc1cc[*]c(O)cc1' (the rest of the structure). The "[*]" terms are a way
to represent an R-group in SMILES.
The method I'll use is to identify the bond and its two end atoms,
delete the bond, and for each of the end atoms add a new "*" atom,
which is an atom with atomic number 0.
I'll identify the bond manually, by counting the atom indices and
identifying the substring which contains the methoxy:
1 ] atom indices, from 0 to 10.
0 1 2 34 56 7 8 90 ] (The top line is the tens.)
vanillin = c1(C=O)cc(OC)c(O)cc1
^^-- methoxy group
^-^ break this bond between atom[4] and atom[5]
I want to break the bond between atoms at 4 and 5. I'll use RDKit for
this. The default RDKit molecule does not support topological
modifications, so I'll have to turn the molecule into a "RWMol"
(Read/Write Molecule) to edit it, then convert the result back to a
normal molecule:
Visual inspection shows that this broke the correct bond, but it's
missing the new "[*]" atoms. I need to go back to the RWMol and add
two atoms with atomic number 0 (which Daylight SMILES terms a
'wildcard' atom):
That's getting closer! Now I need to connect each new atom to the
correct atom which was as the end of the deleted bond. Remember, the
old atom ids were 5 and 6, and you can see from the above code that th
new atom ids are 11 and 12. I'll connect atom 4 to 11 and atom 5 to
12, both using a single bond:
The above code looks right, but the real test is to reassemble the
fragments and see if it gives the same result. Rather than work with
the molecule at the molecular graph level, I'll manipulate the SMILES
string directly, using a ring closure method I described in 2004.
The term "ring closure" sounds like it's only used for rings, but it's
actually simply an alternate way to specify a bond. As Daylight's SMILES theory documentation points out, "C1.C1 specifies
the same molecule as CC(ethane)".
I'll replace the "[*]" symbols with the ring closure "%90", which is
high enough that won't occur in the structures I deal with. I can't do
a simple string substitution since that would turn '[*]OC' into
'%90OC'. If the '[*]' is the first atom, or is after a '.', then I
need to place the %90 after the second atom symbol, giving
"O%90C.c%901cc(C=O)ccc1O". I'll canonicalize that string, and
compare it to the canonicalized vanillin structure:
I apologize. I'm about to show you a big block of code with very
little commentary besides the comments. Feel free to skip to the next section.
In the previous section, I manually tranformed a SMILES string
containing wildcards to one with ring closures so I could test if the
fragmentation code was correct. I can automated that step, but it's a
bit complicated and I won't go into the details. The key observation
is that the wildcard atom is only connected by a single bond to the
rest of the fragment. If the wildcard is the first atom term in the
SMILES then the ring closure goes on the second atom term. Otherwise,
the ring closure goes on the atom immediately before the wildcard
atom. (The last is RDKit-specific becuse RDKit's canonical SMILES ends
up placing branches with wildcard atoms before any other branches.)
The following code automates that process. I call the file
"smiles_syntax.py" and it defines the new function
"convert_wildcards_to_closures(smiles, offsets=None)". Here's an
example:
By default each wildcard atom gets a new closure number, starting from
90, which is why you see the "%90" and "%91". The offsets parameter
specifies which offsets from 90 to use. With "(0, 0)", both
wildcards get the '%90' ring closure.
This will be used later on work with multiple cuts of the same structure.
Here's the content of "smiles_syntax.py".
# I call this "smiles_syntax.py"
from __future__ import print_function
import re
# Match a '*' in the different forms that might occur,
# including with directional single bonds inside of ()s.
_wildcard_regex = " |\n".join(re.escape(regex) for regex in
("*", "[*]", "(*)", "([*])", "(/*)", "(/[*])", "(\\*)", "(\\[*])"))
_wildcard_pattern = re.compile(_wildcard_regex, re.X)
# Match the SMILES for an atom, followed by its closures
_atom_pattern = re.compile(r"""
(
Cl? | # Cl and Br are part of the organic subset
Br? |
[NOSPFIbcnosp] | # as are these single-letter elements
\[[^]]*\] # everything else must be in []s
)
""", re.X)
def convert_wildcards_to_closures(smiles, offsets=None):
# This is designed for RDKit's canonical SMILES output. It does
# not handle all possible SMILES inputs.
if offsets is None:
# Use 0, 1, 2, ... up to the number of '*'s
offsets = range(smiles.count("*"))
closure_terms = []
for offset in offsets:
if not (0 <= offset <= 9):
raise ValueError("offset %d out of range (must be from 0 to 9)"
% (offset,))
closure_terms.append("%%%02d" % (90 + offset))
new_smiles = smiles
while 1:
# Find the first '*'. If none are left, stop.
wildcard_match = _wildcard_pattern.search(new_smiles)
if wildcard_match is None:
break
closure_term = closure_terms.pop(0)
wildcard_start = wildcard_match.start()
if wildcard_start == 0 or new_smiles[wildcard_start-1] == ".":
# At the start of the molecule or after a ".". Need to
# put the closure after the second atom. Find the second
# atom. Since we only ever break on single non-ring bonds,
# and since the first atom is a terminal atom, the second
# atom must either be immediately after the first atom, or
# there is a directional bond between them.
wildcard_end = wildcard_match.end()
second_atom_match = _atom_pattern.match(new_smiles, wildcard_end)
if second_atom_match is None:
# There was no atom. Is it a "/" or "\"? If so,
# we'll need to swap the direction when we move
# to a closure after the second atom.
bond_dir = new_smiles[wildcard_end:wildcard_end+1]
if bond_dir == "/":
direction = "\\"
elif bond_dir == "\\":
direction = "/"
else:
raise AssertionError(new_smiles)
# Look for the second atom, which must exist
second_atom_match = _atom_pattern.match(new_smiles, wildcard_end+1)
if second_atom_match is None:
raise AssertionError((new_smiles, new_smiles[match_end:]))
else:
direction = ""
second_atom_term = second_atom_match.group(1)
# I changed the bond configuration, so I may need to
# invert chirality of implicit chiral hydrogens.
if "@@H" in second_atom_term:
second_atom_term = second_atom_term.replace("@@H", "@H")
elif "@H" in second_atom_term:
second_atom_term = second_atom_term.replace("@H", "@@H")
# Reassemble the string with the wildcard term deleted and
# the new closure inserted directly after the second atom
# (and before any of its closures).
new_smiles = (new_smiles[:wildcard_start] + second_atom_term
+ direction + closure_term
+ new_smiles[second_atom_match.end():])
else:
# The match is somewhere inside of a molecule, so we attach
# assign the closure to the atom it's bonded to on the left
c = new_smiles[wildcard_start-1]
if c == "(" or c == ")":
# In principle, this could be something like "CCC(F)(Cl)[*]",
# where I would need to count the number of groups back to
# the main atom, and flip chirality accordingly. Thankfully,
# RDKit always puts the "[*]" terms immediately after the
# preceeding atom, so I don't need to worry.
raise NotImplementedError("intermediate groups not supported",
new_smiles, new_smiles[wildcard_start-1:])
elif c in "CNcnOS]Pos0123456789ABDEFGHIJKLMQRTUVWXYZabdefghijklmpqrtuvwxyz":
# Double-check the the previous character looks like part of an atom.
wildcard_term = wildcard_match.group()
# Preserve the direction, if present
if "/" in wildcard_term:
direction = "/"
elif "\\" in wildcard_term:
direction = "\\"
else:
direction = ""
new_smiles = (new_smiles[:wildcard_start] + direction + closure_term
+ new_smiles[wildcard_match.end():])
else:
raise AssertionError((new_smiles, c, new_smiles[wildcard_start-1:]))
return new_smiles
if __name__ == "__main__":
for smiles in ("*C", "*/CO.*CN", "C*.C(*)N"):
print(smiles, convert_wildcards_to_closures(smiles, (0,)*smiles.count("*")))
Test driver for fragment_simple()
While the above code looks right, it got that way only as the
result of a lot of testing. There are a lot of corner cases, like
deuterium and directional bonds, which caused the first versions of
this function to fail.
I've found it best to pass a lot of real-world SMILES through code
like this, to see if there are any problem. I usually draw from
PubChem SMILES or ChEMBL. I'll show you what that looks like.
NOTE: the file I test against, "pubchem.smi", excludes structures with
chiral terms. This is deliberate, for reasons I'll get to soon.
The "verify()" function, below, takes a molecule and a list of a
pairs. For each pair of atom describing a bond, it fragments the
molecule on that bond then reassembles it and tests that the result
matches the input.
from __future__ import print_function
from rdkit import Chem
from smiles_syntax import convert_wildcards_to_closures
def fragment_simple(mol, atom1, atom2):
rwmol = Chem.RWMol(mol)
rwmol.RemoveBond(atom1, atom2)
wildcard1 = rwmol.AddAtom(Chem.Atom(0))
wildcard2 = rwmol.AddAtom(Chem.Atom(0))
rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
return rwmol.GetMol()
def verify(mol, bond_atoms):
# Even though I'm not checking for stereochemistry, some of
# my input files contain isotopes. If I don't use isomeric
# SMILES then the input '[2H]' gets represented as '[H]',
# which during recanonicalization is placed on the main atom.
# (RDKit 2016 added an an option to disable that.)
expected_smiles = Chem.MolToSmiles(mol, isomericSmiles=True)
for begin_atom, end_atom in bond_atoms:
fragmented_mol = fragment_simple(mol, begin_atom, end_atom)
fragmented_smiles = Chem.MolToSmiles(fragmented_mol, isomericSmiles=True)
assert "." in fragmented_smiles, fragmented_smiles # safety check
closure_smiles = convert_wildcards_to_closures(fragmented_smiles, (0, 0))
assert "%90" in closure_smiles, closure_smiles # safety check
closure_mol = Chem.MolFromSmiles(closure_smiles)
final_smiles = Chem.MolToSmiles(closure_mol, isomericSmiles=True)
if final_smiles != expected_smiles:
raise ValueError( (expected_smiles, begin_atom, end_atom, fragmented_smiles,
closure_smiles, final_smiles) )
I can run it manually, and also check what it does when I break one of
the ring bonds:
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> verify(vanillin, [(4, 5)])
>>> verify(vanillin, [(3, 4)])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 12, in verify
AssertionError: [*]COc1([*])cc(C=O)ccc1O
I can have it break all single bond which aren't in a ring and which
aren't connected to a wildcard atom (#0) or a hydrogen (#1):
Once I'm happy that the verification function is correct, I can put it
in a test driver:
def main():
# Match a single bond not in a ring
single_bond_pat = Chem.MolFromSmarts("[!#0;!#1]-!@[!#0;!#1]")
# NOTE! This file contains no chiral structures (nothing with a '@')
with open("pubchem.smi") as infile:
## You might try processing every 100th record
# import itertools
# infile = itertools.islice(infile, None, None, 100)
## Or process the lines in random order
# import random
# lines = infile.readlines()
# random.shuffle(lines)
# infile = lines
recno = -1
num_tests = 0
for recno, line in enumerate(infile):
if recno % 100 == 0:
print("Processing record:", recno, "#tests:", num_tests)
smiles, id = line.split()
if "*" in smiles:
print("Skipping record with a '*':", repr(line))
continue
if "@" in smiles:
print("Skipping record with a chiral '@':", repr(line))
continue
mol = Chem.MolFromSmiles(smiles)
if mol is None:
continue
matches = mol.GetSubstructMatches(single_bond_pat)
verify(mol, matches)
num_tests += len(matches)
recno += 1
print("Processed", recno, "records and", num_tests, "tests")
if __name__ == "__main__":
main()
That seems like a lot of test code just to show that the function,
"fragment_simple()", is correct. It's only seven lines long, without
any branches. It's the sort of funciton that should be easy to test by
eye or with a few test cases.
Which is what I did the first ten or so times I wrote that
function. But then I found out that the function doesn't preserve
chirality. Here's an example of the verify() function failing on what
should be a simple test case:
>>> mol = Chem.MolFromSmiles("C[C@](N)(C=O)c1ccccc1")
>>> verify(mol, [(1, 3)])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 21, in verify
ValueError: ('C[C@](N)(C=O)c1ccccc1', 1, 3, '[*]C=O.[*][C@@](C)(N)c1ccccc1',
'C%90=O.[C@@]%90(C)(N)c1ccccc1', 'C[C@@](N)(C=O)c1ccccc1')
You can see the problem in the fragmentation step, where the chirality
somehow inverted:
The second fragment should have "[C@]" instead of "[C@@]". Why did it flip?
The problem comes back to the "RemoveBond()/AddBond()" in
fragment_simple(). RDKit stores the chirality as an atom property, but
that property is actually based on the permutation order of the
bonds. If the permutation order changes, then the chirality changes
even if the chirality property hasn't changed.
To demonstrate, I need a helper function to show the connection table:
_bond_symbol = {
Chem.BondType.SINGLE: "-",
Chem.BondType.DOUBLE: "=",
Chem.BondType.TRIPLE: "#",
Chem.BondType.AROMATIC: ":",
}
def get_bond_info(from_atom_idx, bond):
to_atom_idx = bond.GetBeginAtomIdx()
if from_atom_idx == to_atom_idx:
to_atom_idx = bond.GetEndAtomIdx()
return _bond_symbol[bond.GetBondType()] + str(to_atom_idx)
# Example output line:
# 5 C -1 :6 :10
# This says that atom[5] is a "C"arbon with 3 bond.
# The first bond is a single bond to atom[1].
# The second bond is an aromatic bond to atom[6].
# The third bond is an aromatic bond to atom[10].
def show_connection_table(mol):
for atom in mol.GetAtoms():
bonds = " ".join(get_bond_info(atom.GetIdx(), bond) for bond in atom.GetBonds())
print(atom.GetIdx(), atom.GetSymbol(), bonds)
then use it to show the connection table for the two molecules.
>>> show_connection_table(mol)
0 C -1
1 C -0 -2 -3 -5
2 N -1
3 C -1 =4
4 O =3
5 C -1 :6 :10
6 C :5 :7
7 C :6 :8
8 C :7 :9
9 C :8 :10
10 C :9 :5
>>> show_connection_table(new_mol)
0 C -1
1 C -0 -2 -5 -11
2 N -1
3 C =4 -12
4 O =3
5 C -1 :6 :10
6 C :5 :7
7 C :6 :8
8 C :7 :9
9 C :8 :10
10 C :9 :5
11 * -1
12 * -3
I deleted the bond from atom[1] to atom[3]. It's atom[1] which has the
chiral property so look at the second line of each of the outputs.
The original bond configuration order was (0, 2, 3, 5), while now it's
(0, 2, 5, 11), where the third bond which went to the old atom[5] was
replaced with the new fourth bond to the new wildcard atom[11]. The
permutation order changed. The third and fourth bonds must be swapped
to get back to the original permutation order. Each flip in the
permutation order corresponds to a chirality inversion, which is why
the output chirality is inverted.
Stay tuned to this channel for the continuing story of how to fragment
a molecule and preserve chirality after reassembly.