This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Use FragmentOnBonds to fragment a molecule in RDKit
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Those were mostly pedagogical. They describe the low-level details of
how to cut a bond to fragment a molecule into two parts, and still
maintain correct chirality.
If you are writing production code, you likely shouldn't use any of
that code, because the built-in RDKit function FragmentOnBonds()
does the same thing.
FragmentOnBonds
FragmentOnBonds() will fragment a molecule given a list of bond
identifiers. In default use it attaches a new "dummy" atom (what I've
been calling a wildcard atom) to the atoms at the end of each broken
bond. It will also set the isotope label for each newly created dummy
atom to the atom index of the atom to which it is attached.
>>> from rdkit import Chem
>>> bond_idx = vanillin.GetBondBetweenAtoms(4, 5)
>>> from rdkit import Chem
>>> vanillin = Chem.MolFromSmiles("c1(C=O)cc(OC)c(O)cc1")
>>> bond = vanillin.GetBondBetweenAtoms(4, 5)
>>> bond
<rdkit.Chem.rdchem.Bond object at 0x102bec8a0<
>>> bond.GetIdx()
4
>>> new_mol = Chem.FragmentOnBonds(vanillin, [bond.GetIdx()])
>>> Chem.MolToSmiles(new_mol, isomericSmiles=True)
'[4*]OC.[5*]c1cc(C=O)ccc1O'
I don't want a label, so I'll specify 0 for the atom labels. (An unlabeled atom has an isotope of zero.)
plus some investigation shows that the actual C++ code has the name
"fragmentOnBonds". The RDKit convention is to use an intial uppercase
letter for Python function, and an initial lowercase letter for C++
functions.
A search this time for "fragmentOnBonds" has a few more hits,
including this very suggestive one:
That second result is start of the FragmentOnBondsfunction
definition,. Here's the key part, with commentary.
The function can fragment multiple bonds. It iterates through the
bonds in order. For each one it gets the bond type and the begin and
end atoms, and if requested it stores information about the number of
cuts per atom.
for (unsigned int i = 0; i < bondsToRemove.size(); ++i) {
const Bond *bond = bondsToRemove[i];
unsigned int bidx = bond->getBeginAtomIdx();
unsigned int eidx = bond->getEndAtomIdx();
Bond::BondType bT = bond->getBondType();
res->removeBond(bidx, eidx);
if (nCutsPerAtom) {
(*nCutsPerAtom)[bidx] += 1;
(*nCutsPerAtom)[eidx] += 1;
}
FragmentOnBonds() by default will add "dummy" atoms (atoms with atomic
number 0), but this can be disabled. If enabled, then it will create
the two atoms with atomic number 0. By default it will set the istope
to be the index of the atom it will be attached to, but that can also
be specified with the dummyLabels parameter:
if (addDummies) {
Atom *at1, *at2;
at1 = new Atom(0);
at2 = new Atom(0);
if (dummyLabels) {
at1->setIsotope((*dummyLabels)[i].first);
at2->setIsotope((*dummyLabels)[i].second);
} else {
at1->setIsotope(bidx);
at2->setIsotope(eidx);
}
Next, make the bond from the old terminal atoms to the new wildcard
atoms. By default the bond type for the new bonds is the same as the
bond which was broken (determined earlier). Otherwise, there's an
option to specify an alternate bond type.
unsigned int idx1 = res->addAtom(at1, false, true);
if (bondTypes) bT = (*bondTypes)[i];
res->addBond(eidx, at1->getIdx(), bT);
unsigned int idx2 = res->addAtom(at2, false, true);
res->addBond(bidx, at2->getIdx(), bT);
The last part does what the comment says it does; if the atom has a
tetrahedral chirality then check if the permutation order has changed
and invert the chirality if needed:
// figure out if we need to change the stereo tags on the atoms:
if (mol.getAtomWithIdx(bidx)->getChiralTag() ==
Atom::CHI_TETRAHEDRAL_CCW ||
mol.getAtomWithIdx(bidx)->getChiralTag() ==
Atom::CHI_TETRAHEDRAL_CW) {
checkChiralityPostMove(mol, mol.getAtomWithIdx(bidx),
res->getAtomWithIdx(bidx),
mol.getBondBetweenAtoms(bidx, eidx));
}
if (mol.getAtomWithIdx(eidx)->getChiralTag() ==
Atom::CHI_TETRAHEDRAL_CCW ||
mol.getAtomWithIdx(eidx)->getChiralTag() ==
Atom::CHI_TETRAHEDRAL_CW) {
checkChiralityPostMove(mol, mol.getAtomWithIdx(eidx),
res->getAtomWithIdx(eidx),
mol.getBondBetweenAtoms(bidx, eidx));
}
The code to check if the chirality has changed iterates through all of
the bonds of the old atom ("oAt") to make a list ("newOrder")
containing all of the atom identifiers on the other side of the
bond. (The "OEDGE_ITER" is part of the adapter to work with Boost
Graph library. It's an "out_edge_iterator".)
void checkChiralityPostMove(const ROMol &mol, const Atom *oAt, Atom *nAt,
const Bond *bond) {
INT_LIST newOrder;
ROMol::OEDGE_ITER beg, end;
boost::tie(beg, end) = mol.getAtomBonds(oAt);
while (beg != end) {
const BOND_SPTR obond = mol[*beg];
++beg;
The code knows that the RemoveBond()/AddBond() caused the new dummy
atom to be placed at the end of bond list, so when it sees the old
bond it simply ignores it. Once it's gone through the old atoms, it
adds the new wildcard atom id to the end of the list. Interestingly,
this knowledge isn't something I can depend on, because it's an
implementation detail which might change in the future. That's why I
needed to substitute the new bond id at this point in my code.
if (obond.get() == bond) {
continue;
}
newOrder.push_back(obond->getIdx());
}
newOrder.push_back(bond->getIdx());
The last bit is to compute the permutation order, or as it says,
"perturbation order". I believe that is a typo. Dig down a bit and it
uses a selection sort to count the number of swaps needed to make the
"oAt" atom list and the "newOrder" list match. The result, modulo two,
gives the parity. If the parity is odd, invert the chirality:
unsigned int nSwaps = oAt->getPerturbationOrder(newOrder);
if (nSwaps % 2) nAt->invertChirality();
}
This was a bit different than my approach, which compared the parity
before and after, but it gives the same results.
Jumping back to the fragmentOnBonds() function, the last bit of the
function is:
This is needed because some of the computed properties may depend on
bond patterns which are no longer present. Note that it does not use
SanitizeMol().
I mentioned that checkChiralityPostMove() uses insider knowledge. It
knows that the deleted bond is removed from the list and the new bond
added at the end. It uses this information to build the "newOrder"
list with atom indices in the correct order, before determining the
difference in parity between that list and the list of atom indices
around the new bond.
There's an easier way. Count the number of bond between where the
deleted bond was and the end, and take the result modulo 2. For
example, if the transforamation were:
initial neighbors = [1, 6, 5, 4]
- remove bond to atom 5
neighbors after delete = [1, 6, 4]
- add bond to new atom 9
final neighbors = [1, 6, 4, 9]
then the bond to atom 5 was in the third position, with one bond (to
atom 4) between it and the end of the list. The parity is therefore
odd. I added this suggestion to the RDKit issue tracker as
#1033.
The proof is simple. Assume there are n elements afterward the
bond to delete. Then there are n swaps needed to bring it to
the end, where it can be replaced with the connection to the new
atom. Hence the parity is n%2.
SanitizeMol is slow
In the earlier essay in this series, "Fragment
chiral molecules in RDKit", I investigated some failure cases
where the chirality wasn't preserved. Greg
Landrum pointed out:after you break one or more bonds, you
really, really should re-sanitize the molecule (or at least call
ClearComputedProps() .... I found that if I added SanitizeMol()
then some of the error cases done by using ClearComputedProps() were
no longer error cases. This may be because of a bug in
FastFindRings. Greg, in the second link, writes: There is a
workaround until this is fixed; explicitly sanitize your molecules or
just cal GetSymmSSSR() before you generate SMILES.
So that's what I did. However, without SanitizeMol() there were only
232 failures out of 1.4M+ input structures. Adding SanitizeMol()
didn't fix all of the remaining errors. On the other hand, the
performance overhead to SanitizeMol() is quite a bit. In one timing
test I made, the time to process 10,000 structures using fragment_chiral()
without SanitizeMol() was 107 seconds, while the time with
SanitizeMol() was 255 seconds. The overall processing time takes about
twice as long. (There is an additional constant overhead to parse the
SMILES and canonicalize the fragmented molecule, which I did not
include.)
I don't think the performance impact is worth a slightly error rate,
so I will not include SanitizeMol() in this essay. That means I can
use my new fragment_on_mol() function as-is, and I'll modify
fragment_chiral() so it calls ClearComputedProps() instead of
SanitizeMol().
In any case, the explicit call to SanitizeMol() is a workaround. I
hope ClearComputedProps() will suffice in the RDKit of the future
world that you, the reader, inhabit.
Cross-validation
In the earlier essay I tested the fragment_chiral() function by
attempting to re-connect the fragments. If the canonicalized result
doesn't match the canonicalized input structure then there's a
problem. (And there were a small number of problems, at the edge of
RDKit's ability to handle chemisty.)
That test ignored what SMILES calls "directional bonds", that is the
"/" and "\" in "F/C=C\F", which is how SMILES denotes the
stereochemistry of double bonds. This is how SMILES deals with "E-Z
notation", which would be more familiar to a chemist. The tests
ignored those bonds for the simple reason that FragmentOnBonds()
doesn't preserve that information if it cuts the bond. In a future
essay I'll show how to implement that ability.
Instead of fully testing fragment_on_bonds(), I'll do a weaker test
where I compare its results to fragment_chiral(). I say it's weaker
because what it shows is that they are likely bug-compatible, which is
different than being correct. It's also weaker because the two
implementation methods are very similar. A stronger test would use a
different way to fragment.
The test code is not interesting, so I'll put it at the end and not
discuss it in detail. The results weren't interesting either. There
were no mismatches. I just had to wait for a few hours while I let it
process the 1.4M+ structures from ChEMBL 20.
The two implementation gave identical results, and time needed for
fragment_chiral() is about 1.7x that needed for
fragment_on_bond(). The fragment_on_bond() code is shorter, faster,
and easier to understand, so you should always use it instead of fragment_chiral().
The code
The fragment_chiral() function here is a copy of the previous essay,
except I replaced the SanitizeMol(new_mol) with a
mol.ClearComputedProps() for 2x performance, at the expense of perhaps
a few incorrect structures. I put it here so the test code was
self-contained, except of course you'll need your own dataset.
from __future__ import print_function
import time
import datetime
from rdkit import Chem
# Two different methods to cut a bond and fragment an RDKit molecule.
# - fragment_on_bond() uses RDKit's FragmentOnBonds()
# - fragment_chiral() uses lower-level API calls
# This also includes a cross-validation function which checks that the
# two methods produce the same fragments as output.
# The final two lines when I evaluated ChEMBL20 were:
#
# Processed 1455762 records, 1455763 molecules, 16017700 tests, 0 mismatches. T_chiral: 5912.26 T_on_bond 3554.22
# elapsed time: 8:30:15
#
# The timings show that fragment_on_bond() is about 1.7x the performance of fragment_chiral().
# I also used this to evalute the performance impact of SanitizeMol()
# After 10,000 molecules, the times were:
# with SanitizeMol(): T_chiral: 254.94 T_on_bond 210.82
# without SanitizeMol(): T_chiral: 107.22 T_on_bond 64.99
# That call adds a lot of overhead for little gain in accuracy.
#### Fragment using a high-level RDKit API function
def fragment_on_bond(mol, atom1, atom2):
bond = mol.GetBondBetweenAtoms(atom1, atom2)
new_mol = Chem.FragmentOnBonds(mol, [bond.GetIdx()], dummyLabels=[(0, 0)])
# After breaking bonds, should re-sanitize
# See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
# However, I didn't see much of an improvement, except for chiral
# carbon bridgeheads, and the result has 1/3 the performance
#Chem.SanitizeMol(new_mol)
return new_mol
#### Fragment using low-level RDKit API functions
# See http://www.dalkescientific.com/writings/diary/archive/2016/08/14/fragment_chiral_molecules.html
# for implementation discussion
CHI_TETRAHEDRAL_CW = Chem.ChiralType.CHI_TETRAHEDRAL_CW
CHI_TETRAHEDRAL_CCW = Chem.ChiralType.CHI_TETRAHEDRAL_CCW
def parity_shell(values):
# Simple Shell sort; while O(N^2), we only deal with at most 4 values
# See http://www.dalkescientific.com/writings/diary/archive/2016/08/15/fragment_parity_calculation.html
# for faster versions for fixed-size lists.
values = list(values)
N = len(values)
num_swaps = 0
for i in range(N-1):
for j in range(i+1, N):
if values[i] > values[j]:
values[i], values[j] = values[j], values[i]
num_swaps += 1
return num_swaps % 2
def get_bond_parity(mol, atom_id):
"""Compute the parity of the atom's bond permutation
Return None if it does not have tetrahedral chirality,
0 for even parity, or 1 for odd parity.
"""
atom_obj = mol.GetAtomWithIdx(atom_id)
# Return None unless it has tetrahedral chirality
chiral_tag = atom_obj.GetChiralTag()
if chiral_tag not in (CHI_TETRAHEDRAL_CW, CHI_TETRAHEDRAL_CCW):
return None
# Get the list of atom ids for the each atom it's bonded to.
other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
# Use those ids to determine the parity
return parity_shell(other_atom_ids)
def set_bond_parity(mol, atom_id, old_parity, old_other_atom_id, new_other_atom_id):
"""Compute the new bond parity and flip chirality if needed to match the old parity"""
atom_obj = mol.GetAtomWithIdx(atom_id)
# Get the list of atom ids for the each atom it's bonded to.
other_atom_ids = [bond.GetOtherAtomIdx(atom_id) for bond in atom_obj.GetBonds()]
# Replace id from the new wildcard atom with the id of the original atom
i = other_atom_ids.index(new_other_atom_id)
other_atom_ids[i] = old_other_atom_id
# Use those ids to determine the parity
new_parity = parity_shell(other_atom_ids)
if old_parity != new_parity:
# If the parity has changed, invert the chirality
atom_obj.InvertChirality()
def fragment_chiral(mol, atom1, atom2):
"""Cut the bond between atom1 and atom2 and replace with connections to wildcard atoms
Return the fragmented structure as a new molecule.
"""
rwmol = Chem.RWMol(mol)
atom1_parity = get_bond_parity(mol, atom1)
atom2_parity = get_bond_parity(mol, atom2)
rwmol.RemoveBond(atom1, atom2)
wildcard1 = rwmol.AddAtom(Chem.Atom(0))
wildcard2 = rwmol.AddAtom(Chem.Atom(0))
new_bond1 = rwmol.AddBond(atom1, wildcard1, Chem.BondType.SINGLE)
new_bond2 = rwmol.AddBond(atom2, wildcard2, Chem.BondType.SINGLE)
if atom1_parity is not None:
set_bond_parity(rwmol, atom1, atom1_parity, atom2, wildcard1)
if atom2_parity is not None:
set_bond_parity(rwmol, atom2, atom2_parity, atom1, wildcard2)
# After breaking bonds, should re-sanitize
# See https://github.com/rdkit/rdkit/issues/1022#issuecomment-239355482
# However, I didn't see much of an improvement, except for chiral
# carbon bridgeheads, and the result has 1/2 the performance.
new_mol = rwmol.GetMol()
# I must ClearComputedProps() after editing the structure.
new_mol.ClearComputedProps()
#Chem.SanitizeMol(new_mol)
return new_mol
####### Cross-validation code
def read_records(filename):
with open(filename) as infile:
for recno, line in enumerate(infile):
smiles, id = line.split()
if "*" in smiles:
continue
yield recno, id, smiles
def cross_validate():
# Match a single bond not in a ring
BOND_SMARTS = "[!#0;!#1]-!@[!#0;!#1]"
single_bond_pat = Chem.MolFromSmarts(BOND_SMARTS)
num_mols = num_tests = num_mismatches = 0
time_chiral = time_on_bond = 0.0
# Helper function to print status information. Since
# the function is defined inside of another function,
# it has access to variables in the outer function.
def print_status():
print("Processed {} records, {} molecules, {} tests, {} mismatches."
" T_chiral: {:.2f} T_on_bond {:.2f}"
.format(recno, num_mols, num_tests, num_mismatches,
time_chiral, time_on_bond))
filename = "/Users/dalke/databases/chembl_20_rdkit.smi"
start_time = datetime.datetime.now()
for recno, id, smiles in read_records(filename):
if recno % 1000 == 0:
print_status()
mol = Chem.MolFromSmiles(smiles)
if mol is None:
continue
num_mols += 1
# Find all of the single bonds
matches = mol.GetSubstructMatches(single_bond_pat)
for a1, a2 in matches:
# For each pair of atom indices, split on that bond
# and determine the canonicalized fragmented molecule.
# Do it for both fragment_chiral() ...
t1 = time.time()
mol_chiral = fragment_chiral(mol, a1, a2)
t2 = time.time()
time_chiral += t2-t1
smiles_chiral = Chem.MolToSmiles(mol_chiral, isomericSmiles=True)
# ... and for fragment_on_bond()
t1 = time.time()
mol_on_bond = fragment_on_bond(mol, a1, a2)
t2 = time.time()
time_on_bond += t2-t1
smiles_on_bond = Chem.MolToSmiles(mol_on_bond, isomericSmiles=True)
# Report any mismatches and keep on going.
num_tests += 1
if smiles_chiral != smiles_on_bond:
print("Mismatch: record: {} id: {} smiles: {} cut: {} {}".format(
recno, id, smiles, a1, a2))
print(" smiles_chiral:", smiles_chiral)
print("smiles_on_bond:", smiles_on_bond)
num_mismatches += 1
end_time = datetime.datetime.now()
elapsed_time = str(end_time - start_time)
elapsed_time = elapsed_time.partition(".")[0] # only display subseconds
print("Done.")
print_status()
print("elapsed time:", elapsed_time)
if __name__ == "__main__":
cross_validate()