This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Dealing with SSSR
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
This delves into the depths of how different toolkits use the SSSR
(Smallest Set of Smallest Rings) algorithm. The end result is that I
will be removing support for bits 257-262 of the ChemFP Substruct key
because I cannot reliably implement "number of aromatic rings" and
"number of hetero-aromatic rings" across multiple toolkits.
Travel note: I will be at the 9thICCS conference,
leaving tomorrow. Hope to see some of my readers there!
ESSSR
Section 2 of the PubChem fingerprint definition starts:
Section 2: Rings in a canonic Extended Smallest Set of Smallest Rings
(ESSSR) ring set - These bits test for the presence or count of the
described chemical ring system. An ESSSR ring is any ring which does
not share three consecutive atoms with any other ring in the chemical
structure. For example, naphthalene
has three ESSSR rings (two phenyl fragments and the 10-membered
envelope), while biphenyl
will yield a count of only two ESSSR rings.
Looking at this now, I realized something I missed the previous many
times I looked - naphthalene has only two cycles, which means any SSSR
algorithm should find two rings. (Otherwise it's not the smallest
set.) How come the documentation says there are three ESSSR
rings?
SSSR lovers and haters
The CDK implementation of the ESSSR dependent bits uses its SSSR
algorithm. For example, bit 115 is set for ">=1 any ring of size 3."
The corresponding CDK code is:
ringSet = finder.findSSSR();
...
public int countAnyRing(int size) {
int c = 0;
for (IAtomContainer ring : ringSet.atomContainers()) {
if (ring.getAtomCount() == size)
c++;
}
return c;
}
...
b = 115;
if (cr.countAnyRing(3) >= 1) fp[b >> 3] |= MASK[b % 8];
We believe that it is a great service to our customers that we do not
include any SSSR functionality in OEChem.
As a consequence, they changed the meaning of the SMARTS "R"
definition to indicate the number of ring bonds an atom has rather
than the number of SSSR rings that it's in. "[R1]" under Daylight is
roughly equivalent to "[R2]" under OpenEye.
Don't use SSSR when SMARTS is close enough
My goal for the ChemFP substructure keys is to develop a useful and
especially cross-platform substructure filter definition closely based
on the PubChem keys. They don't need to be identical to
PubChem. OpenEye doesn't have SSSR support so I decided it would be
easier to implement those ring definitions as SMARTS patterns. For bit
115 that's "*~1~*~*1". All of the fixed-sized rings can be
done this way.
SMARTS isn't good enough; counting aromatic rings
SMARTS like this are very portable, but there are limits to the types
of patterns that SMARTS can handle. Consider these bits:
The SMARTS for 255 and 226 are [aR] and [aR;#!6] but
there are no SMARTS patterns for the other. It must be done through
some other means.
Here's the CDK implementation for bit 257 (2 or more aromatic rings):
public int countAromaticRing() {
int c = 0;
for (IAtomContainer ring : ringSet.atomContainers()) {
if (isAromaticRing(ring)) c++;
}
return c;
}
...
b = 257;
if (cr.countAromaticRing() >= 2) fp[b >> 3] |= MASK[b % 8];
See how it uses the already computed SSSR?
I went ahead an implemented something like it for OpenBabel, RDKit,
and Indigo. Here's how to do it:
def openbabel_count_aromatic_rings(mol):
count = 0
for ring in mol.GetSSSR():
# Note: the OB implementation is wrong. It assumes that if all
# atoms in the ring are aromatic then the ring itself must be
# aromatic. This is not necessarily true.
if ring.IsAromatic():
count += 1
return count
def rdkit_count_aromatic_rings(mol):
count = 0
for ring in mol.GetRingInfo().BondRings():
if all(mol.GetBondWithIdx(bondIdx).GetIsAromatic() for bondIdx in ring):
count += 1
return count
def indigo_count_aromatic_rings(mol):
count = 0
mol.aromatize() # XXX missing from chemfp!
for ring in mol.iterateSSSR():
# bond-order == 4 means "aromatic"; all rings bonds must be aromatic
if all(bond.bondOrder() == 4 for bond in ring.iterateBonds()):
count += 1
return count
(And yes, I know I could have done "count += all(....)".)
Cross-toolkit comparison of the aromatic ring counts
How toolkit-dependent is this definition? I've got a bunch of PubChem
data. Here's a driver script using the molecule readers in chemfp. (Do
note that the API will change. I'm going to swap the order of
molecule and title in the reader iterator.)
Indigo noticibly disagrees with the consensus. If we just consider
OpenBabel and RDKit then there's an toolkit bias of about 1.5-2%.
How many aromatic rings are obvious even without SSSR?
However! How many of these are trivial? That is, most of these are
likely independent rings, which of course won't have a problem. What
we need to do is compare the number of toolkit differences to the
number of times where there could be a difference.
While OpenEye doesn't have SSSR, they do have a "count number of
aromatic ring systems" function, and we can use that to compare the
number of aromatic rings found by RDKit to the number of aromatic ring
systems found by OpenEye.
from itertools import izip
from chemfp import openeye, rdkit
from openeye.oechem import (OEDetermineAromaticRingSystems, OECreateCanSmiString,
OEThrow, oeosstream)
# Disable printing OpenEye's warnings
oes = oeosstream()
OEThrow.SetOutputStream(oes)
def rdkit_count_aromatic_rings(mol):
count = 0
for ring in mol.GetRingInfo().BondRings():
if all(mol.GetBondWithIdx(bondIdx).GetIsAromatic() for bondIdx in ring):
count += 1
return count
def openeye_min_aromatic_rings(mol):
num_aromatic_systems, parts = OEDetermineAromaticRingSystems(mol)
return num_aromatic_systems
def ring_compare(filename):
num_zero = num_lt = num_eq = num_gt = 0
for ((oe_title, oe_mol), (rd_title, rd_mol)) in izip(
openeye.read_structures(filename),
rdkit.read_structures(filename)):
assert oe_title == rd_title
num_oe = openeye_min_aromatic_rings(oe_mol)
num_rd = rdkit_count_aromatic_rings(rd_mol)
if num_oe == num_rd == 0:
num_zero += 1
elif num_oe == num_rd:
num_eq += 1
elif num_oe < num_rd:
num_gt += 1
else:
num_lt += 1
print "LESS?!", oe_title, OECreateCanSmiString(oe_mol)
print filename, "lt:", num_lt, "equal:", num_eq, "greater:", num_gt
for filename in (
"/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz",
"/Users/dalke/databases/pubchem/Compound_004000001_004025000.sdf.gz",
"/Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz",
"/Users/dalke/databases/pubchem/Compound_006000001_006025000.sdf.gz",
"/Users/dalke/databases/pubchem/Compound_008000001_008025000.sdf.gz",
"/Users/dalke/databases/pubchem/Compound_009000001_009025000.sdf.gz"):
ring_compare(filename)
In theory RDKit will always find at least the number of aromatic ring
systems which OEChem finds, but they have different definitions of
what "aromaticity" means, so there are a few cases otherwise.
In other words, only 4190+5793+5485+1880+5486+4034 = 26868 compounds
can have different aromatic ring counts because 102488 of the
compounds have easily identified rings.
Now go back to the previous numbers. There were 2,068 times where
OpenBabel did not agree with the consensus. In other words, 8% of the
times where OpenBabel could differ from RDKit, were
different. For Indigo it's fully 50% of the time,
The hetero-aromatics are a bit harder to calculate. I need to find
aromatic rings which also contain a non-carbon. Here are the
implementations for OpenBabel, RDkit and Indigo.
def openbabel_count_hetero_rings(mol):
count = 0
for ring in mol.GetSSSR():
if (ring.IsAromatic() and
any(mol.GetAtom(atom_id).GetAtomicNum() != 6 for atom_id in ring._path)):
count += 1
return count
The hardest was RDKit, which doesn't have a "ring" as an independent
concept. It has a RingInfo, but that gives you all the atoms in all
the rings, or all the bonds in all the rings, as lists of lists of
either atom or bond indicies. These are parallel lists, so I can
extract the atoms and bonds which correspond to a given ring by doing
a zip.
def rdkit_count_hetero_rings(mol):
count = 0
ring_info = mol.GetRingInfo()
ring_atom_info = ring_info.AtomRings()
ring_bond_info = ring_info.BondRings()
for (ring_atoms, ring_bonds) in zip(ring_atom_info, ring_bond_info):
if not all(mol.GetBondWithIdx(bondIdx).GetIsAromatic() for bondIdx in ring_bonds):
continue
if any(mol.GetAtomWithIdx(atomIdx).GetAtomicNum() != 6 for atomIdx in ring_atoms):
count += 1
return count
And lastly, Indigo:
def indigo_count_hetero_rings(mol):
count = 0
for ring in mol.iterateSSSR():
# bond-order == 4 means "aromatic"; all rings bonds must be aromatic
if not all(bond.bondOrder() == 4 for bond in ring.iterateBonds()):
continue
if any(atom.atomicNumber() != 6 for atom in ring.iterateAtoms()):
count += 1
return count
With that in place the change to the consensus SSSR search program is
simply:
Looking at the totals, only 67030 of the 129405 structure had an
hetero-aromatic ring, and there were 2082 times when the OpenBabel did
not agree with the consensus, by which I'll interpret to mean that
there's a 3% difference in the counts between RDKit and OpenBabel.
Again you see that RDKit was closest to the consensus. Here's some
cases where all three differed. I have not analyzed why:
How many hetero-aromatic rings are obvious even without SSSR?
Again, to judge how much of a problem this is we need to figure out
how many of these hetero-aromatic ring counts were obvious, that is,
where the SSSR implementation is not relevant.
OpenEye's OEDetermineAromaticRingSystems() returns two components. The
first is the total number of system, the second is a lookup table from
the atom index to the component number that it's in. (Component 0 is
reserved for those atoms not in a ring system.)
If I iterate over the non-carbon aromatic atoms, and get the component
numbers for each one, and turn them into a set, then the size of the
set is the number of ring systems with at least one hetero-aromatic
atom. That's confusing to read, and I'm not sure the code is any
cleaner to someone who hasn't used the toolkit before, but here it is:
def openeye_min_hetero_rings(mol):
num_aromatic_systems, parts = OEDetermineAromaticRingSystems(mol)
# Atoms which match [a;!#6]
hetero_atoms = mol.GetAtoms(OEAndAtom(OEIsAromaticAtom(),
OENotAtom(OEIsCarbon())))
atom_components = set(parts[atom.GetIdx()] for atom in hetero_atoms)
return len(atom_components)
and the change to the OEChem/RDKit comparison code is:
which works out to only 6482 structures which can have a difference.
Toolkit hetero-aromatic ring counts vs. complex hetero-aromatic ring systems
Since there were 2082 cases where OpenBabel gave a different answer
than RDKit, that means that about 1/3 of the times where OpenBabel
could differ from RDKit, were different.
Unfortunately, it's not as simple as that. Indigo reports 11,000
differences from the consensus, and 11,000 > 6,482. Why is this?
Almost certainly because of differences in aromaticity or differences
in SSSR determination. If multiple SSSRs are possible, then it's
possible that one choice of SSSRs ends up with one hetero-aromatic
ring, while another choice gives two such rings.
Someone could go through and figure out why they are different. That
person isn't me. Perhaps it's you?
Why do the existing toolkits use SSSR? I don't know. History? Tradition?
What do people use SSSR for in the existing toolkits? Again, I don't
know. Have they evalauted the stability of the results due to atom
input order?
What is ESSSR? That's the algorithm PubChem says they use for those
bits. The Berger paper doesn't mention ESSSR but it does mention
"Extended Set of Smallest Rings (ESSR)" and says it:
... was introduced by Downs et al. [17] as an approach to design an
optimal ring set for retrieval purposes. ESSR by definition is
limited to planar graphs.
Perhaps that is the ESSR used at PubChem, but not only are some
PubChem structure non-planar, but the Berger paper shows examples
where ESSR is not correct. It does say that an ESSR variation "ESSR'
... is indeed a useful definition for a chemically relevant 'extended
set of smallest rings' that could be generated reasonably
efficiently..."
No (E)SSSR will be used in the ChemFP substructure fingerprint
I am so not going to implement that alogorithm for each toolkit, for
the sole purposes of computing a couple of extra bits. It's not
portable and it appears to be highly variable. I would need a lot of
testing to convince myself that something as simple as the atom order
doesn't affect the hetero-atom ring counts.
Therefore, I'm going to leave those bits as 0 in ChemFP's
implementation, unless I can find something appropriate.
Comments?
Do you use SSSR? How do you handle these problems? Do you have a
better suggestion for me than to ignore those bits?