This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Molecular fingerprints, background
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
A molecular "fingerprint"
is a widely used concept in chemical informatics. Comparing molecules
is hard. Comparing bitstrings is easy. Many people turn a molecule
into a bitstring under the assumption that comparing the bitstring
gives insight to how to compare molecules. To make life more
complicated, chemical informatics uses the term "fingerprint" but feel
free to think of it in your head as "bitstring."
There are two main major types of fingerprints, and lots of lesser
used ones. "Structural fingerprints" are based on substructure
features. The most widely known and used are the MACCS keys, which
are available in two flavors: 166 bit and 320 bit. There's also the
CACTVS substructure keys, which PubChem uses.
Each bit in a structural fingerprint corresponds to some chemical
property, usually some sort of substructure presence. These are
designed to be chemically relevant, which usually means it's good for
the sorts of chemistry that the designer was interested in. It was
easier to find documentation about the PUBCHEM_CACTVS_SUBSKEYS
in PubChem so I'll share a bit of that definition.
Section 1: Hierarchic Element Counts - These bits test for the
presence or count of individual chemical atoms represented
by their atomic symbol.
Bit Position Bit Substructure
0 >= 4 H
1 >= 8 H
2 >= 16 H
3 >= 32 H
4 >= 1 Li
5 >= 2 Li
6 >= 1 B
7 >= 2 B
8 >= 4 B
...
Section 2: Rings in a canonic Extended Smallest Set of Smallest Rings (ESSSR) ring set
...
115 >= 1 any ring size 3
116 >= 1 saturated or aromatic carbon-only ring size 3
117 >= 1 saturated or aromatic nitrogen-containing ring size 3
118 >= 1 saturated or aromatic heteroatom-containing ring size 3
119 >= 1 unsaturated non-aromatic carbon-only ring size 3
...
Section 7: Complex SMARTS patterns
...
713 Cc1ccc(C)cc1
714 Cc1ccc(O)cc1
...
Substructure keys are often used in substructure searches as a way to
filter out obvious mismatches before doing a computationally expensive
subgraph isomorphism test. They are sometimes used for similarity
searches but in practice I think hash fingerprints are better because
some bits are more likely to be set than others (how many compounds do
you deal with that have 4 or more borons in them?).
The other major fingerprint type is called "hash fingerprints." The
most well known of these are the Daylight fingerprints. Enumerate all
linear substructures of length N in the molecule. Daylight sets the
bounds on N as 3 <= N <= 7, but this is tunable. There are N atoms
and N-1 bonds in this length. Each atom and bond has a numeric value,
based on the element type, bond type, and other properties deemed
interesting. Use some sort of hash function to
turn this 2*N-1 valued term into an integer. Usually the last step of
the process is to take that integer modulo the number of bits in the
fingerprint so that it fits in the space available. Mark that bit
with a 1. Many different paths might cause the same bit to be set to
1.
Hash fingerprints are always a multiple of 8 bits long and usually a
power of 2 between 128 bits and 4096 bits. (Even OpenBabel, which
uses a 1021 bit hash, maps that to a 1024 bit fingerprint, with 0 for
the remaining 3 bits.) They are very similar to Bloom filters.
Hash fingerprints are used most often in similarity searching, with
the hope that two similar compounds create similar fingerprints, and
that two similiar fingerprints means the compounds are similar.
Supposing that's true, how do you compare two fingerprints FP1 and
FP2? There are many ways. One is to count the number of shared bits,
that it, the number of bits which are on in both bitstrings. This is
a good start, but consider two large quite dissimilar molecules which
both have a section in common but otherwise are quite dissimilar.
Intuitively you want the dissimilar structure to reduce the overall
similarity score.
There's a huge number of ways to compare hash fingerprints, but each
depends on only 4 different values:
a = the number of "on" bits in FP1 but not in FP2
b = the number of "on" bits in FP2 but not in FP1
c = the number of "on" bits in FP1 which are also "on" in FP1
d = the number of "off" bits in FP1 which are also "off" in FP1
When implementing the comparison routines it's easier to use slightly
different definitions:
n = the total number of bits in A or B = a+b+c+d
A = the nuber of "on" bits in FP1 = a+c
B = the nuber of "on" bits in FP2 = b+c
I'm using the Daylight
fingerprint notation here. It's a bit annoying because in my code
I'll really want to use lower case "a" and "b" for what they use as
"A" and "B". When I get to the code I'm likely going to switch.
The common term by those in the know for the "total number of 'on'
bits" is "popcount", which is short for "population count." The
collective wisdom of Wikipedia prefers Hamming weight.
Which reminds me, take a break and enjoy Hamming's
seminar "You and Your Research".
Because the bitstrings are bits, the way to compute "c" (the number of
bits in common) is to compute the boolean "and" (fp1 & fp2) and
find the popcount of the result.
The most widely used and accepted finger similarity function in
chemical informatics is called the Tanimoto coefficient or Tanimoto
score, or simply "the Tanimoto." It's identical to the Jaccard index
as applied to binary bitstrings, and is defined as:
c/(a+b+c)
which is the same as
c/(A+B-c)
You can see that identical fingerprints have a score of 1.0, and when
there are no bits in common the score is 0.0. Also note how if there
are more bits in only one or the other fingerprint then the score
descreases. If there are no bits in either fingerprint then this
generates a "0/0" case. It means that the fingerprint generation code
couldn't handle the chemistry involved. For hash fingerprints this
should never happen, except perhaps with the vacuum case. For some
sorts of substructure keys it might happen. It's like trying to
compare a shark and Mt. Everest, based on how similar their wheels
are.
By the way, there's another way to define the Tanimoto coefficient of
two fingerprints. It's the number of on bits in the intersection of
the two fingerprints, divided by the number of on bits in their union.
I learned that one reading the OpenBabel source code.