Python Buzz Forum - Molecular fingerprints, background

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Python Buzz Forum
Molecular fingerprints, background

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.

Molecular fingerprints, background

Posted: Jun 26, 2008 6:11 AM

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: Molecular fingerprints, background Feed Title: Andrew Dalke's writings Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.	Latest Python Buzz Posts Latest Python Buzz Posts by Andrew Dalke Latest Posts From Andrew Dalke's writings

A molecular "fingerprint" is a widely used concept in chemical informatics. Comparing molecules is hard. Comparing bitstrings is easy. Many people turn a molecule into a bitstring under the assumption that comparing the bitstring gives insight to how to compare molecules. To make life more complicated, chemical informatics uses the term "fingerprint" but feel free to think of it in your head as "bitstring."

There are two main major types of fingerprints, and lots of lesser used ones. "Structural fingerprints" are based on substructure features. The most widely known and used are the MACCS keys, which are available in two flavors: 166 bit and 320 bit. There's also the CACTVS substructure keys, which PubChem uses.

Each bit in a structural fingerprint corresponds to some chemical property, usually some sort of substructure presence. These are designed to be chemically relevant, which usually means it's good for the sorts of chemistry that the designer was interested in. It was easier to find documentation about the PUBCHEM_CACTVS_SUBSKEYS in PubChem so I'll share a bit of that definition.

Section 1: Hierarchic Element Counts - These bits test for the 
presence or count of individual chemical atoms represented 
by their atomic symbol.

Bit Position	Bit Substructure
0	>= 4 H
1 	>= 8 H
2	>= 16 H
3	>= 32 H
4	>= 1 Li
5	>= 2 Li
6	>= 1 B
7	>= 2 B
8	>= 4 B
 ...
Section 2: Rings in a canonic Extended Smallest Set of Smallest Rings (ESSSR) ring set
 ...
115	>= 1 any ring size 3
116	>= 1 saturated or aromatic carbon-only ring size 3
117	>= 1 saturated or aromatic nitrogen-containing ring size 3
118	>= 1 saturated or aromatic heteroatom-containing ring size 3
119	>= 1 unsaturated non-aromatic carbon-only ring size 3
...
Section 7: Complex SMARTS patterns
 ...
713	Cc1ccc(C)cc1
714	Cc1ccc(O)cc1
 ...

Substructure keys are often used in substructure searches as a way to filter out obvious mismatches before doing a computationally expensive subgraph isomorphism test. They are sometimes used for similarity searches but in practice I think hash fingerprints are better because some bits are more likely to be set than others (how many compounds do you deal with that have 4 or more borons in them?).

The other major fingerprint type is called "hash fingerprints." The most well known of these are the Daylight fingerprints. Enumerate all linear substructures of length N in the molecule. Daylight sets the bounds on N as 3 <= N <= 7, but this is tunable. There are N atoms and N-1 bonds in this length. Each atom and bond has a numeric value, based on the element type, bond type, and other properties deemed interesting. Use some sort of hash function to turn this 2*N-1 valued term into an integer. Usually the last step of the process is to take that integer modulo the number of bits in the fingerprint so that it fits in the space available. Mark that bit with a 1. Many different paths might cause the same bit to be set to 1.

Hash fingerprints are always a multiple of 8 bits long and usually a power of 2 between 128 bits and 4096 bits. (Even OpenBabel, which uses a 1021 bit hash, maps that to a 1024 bit fingerprint, with 0 for the remaining 3 bits.) They are very similar to Bloom filters. Hash fingerprints are used most often in similarity searching, with the hope that two similar compounds create similar fingerprints, and that two similiar fingerprints means the compounds are similar.

Supposing that's true, how do you compare two fingerprints FP1 and FP2? There are many ways. One is to count the number of shared bits, that it, the number of bits which are on in both bitstrings. This is a good start, but consider two large quite dissimilar molecules which both have a section in common but otherwise are quite dissimilar. Intuitively you want the dissimilar structure to reduce the overall similarity score.

There's a huge number of ways to compare hash fingerprints, but each depends on only 4 different values:

a = the number of "on" bits in FP1 but not in FP2
b = the number of "on" bits in FP2 but not in FP1
c = the number of "on" bits in FP1 which are also "on" in FP1
d = the number of "off" bits in FP1 which are also "off" in FP1

When implementing the comparison routines it's easier to use slightly different definitions:

n = the total number of bits in A or B = a+b+c+d
A = the nuber of "on" bits in FP1 = a+c
B = the nuber of "on" bits in FP2 = b+c

I'm using the Daylight fingerprint notation here. It's a bit annoying because in my code I'll really want to use lower case "a" and "b" for what they use as "A" and "B". When I get to the code I'm likely going to switch.

The common term by those in the know for the "total number of 'on' bits" is "popcount", which is short for "population count." The collective wisdom of Wikipedia prefers Hamming weight. Which reminds me, take a break and enjoy Hamming's seminar "You and Your Research".

Because the bitstrings are bits, the way to compute "c" (the number of bits in common) is to compute the boolean "and" (fp1 & fp2) and find the popcount of the result.

The most widely used and accepted finger similarity function in chemical informatics is called the Tanimoto coefficient or Tanimoto score, or simply "the Tanimoto." It's identical to the Jaccard index as applied to binary bitstrings, and is defined as:

  c/(a+b+c)

which is the same as

  c/(A+B-c)

You can see that identical fingerprints have a score of 1.0, and when there are no bits in common the score is 0.0. Also note how if there are more bits in only one or the other fingerprint then the score descreases. If there are no bits in either fingerprint then this generates a "0/0" case. It means that the fingerprint generation code couldn't handle the chemistry involved. For hash fingerprints this should never happen, except perhaps with the vacuum case. For some sorts of substructure keys it might happen. It's like trying to compare a shark and Mt. Everest, based on how similar their wheels are.

By the way, there's another way to define the Tanimoto coefficient of two fingerprints. It's the number of on bits in the intersection of the two fingerprints, divided by the number of on bits in their union. I learned that one reading the OpenBabel source code.

Read: Molecular fingerprints, background

Previous Topic

Next Topic


	Web Artima.com