This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: chemfp-1.0a1
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Any long time reader knows that I'm interested in chemical
fingerprints. That link points to a 7 part series on how to
generate fingerprints, how to compute the Tanimoto score between two
fingerprints, and how to do that calculation quickly.
If you already know what I'm doing then I'll jump to the punchline. My
chemfp
project has just released chemfp-1.0a1.tgz. It
supports:
fingerprint extraction from SD files including PubChem's substructure fingerprints
native fingerprint generation in RDKit, OpenBabel, and OpenEye's OEChem
cross-platform implementations of the RDKit MACCS keys and a variation of the
PubChem Substructure keys
Tanimoto substructure search
It's an alpha release because the test suite isn't complete and in
doing the documentation in the last day I've found a number of corners
where I just haven't tested the code paths. It's also in alpha because
I want feedback about
its usefulness, and what you think I should do to make the format and
the project to be more useful.
Background
Fingerprints are one of the core concepts in my field. It's hard to
read a few papers in my field without coming across something to do
with fingerprint generation or scoring. Yet for all of the
sophisticated understanding of the mathematical concept, the software
ecosystem around fingerprints are rather weak.
The RDKit, OpenBabel, OpenEye's OEChem, the CDK, Schrodinger's Canvas,
Xemistry's CACTVS, Accelrys's PipelinePilot, CCG's MOE, and many, many
more tools generate fingerprints. I'm limiting myself to those which
produce dense binary fingerprints of size less than about 10,000 bits,
and usually only a few hundred. The best known are the MACCS keys with
166 bits. The CACTVS keys in PubChem are 661 bits, and hash
fingerprints are usually around 1024-4096 bits.
So here's a question: how well does RDKit's MACCS key implementation
compare to OpenEye's? Which provides a better similarity score: the
CACTVS substructure keys or the slightly longer OpenBabel path
fingerprints?
Try answering that question and you'll see that there is no exchange
format for fingerprints. OpenBabel, Canvas, and OEChem all have
black-box storage files for fast fingerprint search, but they are not
meant for external use. Instead, you'll likely end up making your own
format, and tools which work with that type. Which aren't portable
outside of your own research group. (In one case a research used a
format with space separated "0"s, and "1"s, meaning 2 bytes to store
each bit!)
I first ran into this problem years ago when a client had a fixed data
set and they wanted a web application to return the nearest 3
structures in the data set. They had the Daylight toolkit, which
support the basic fingerprint functions but which doesn't have an
external file format. I had to roll one, which doesn't take much time,
but it wasn't worth my client's time for all the optimizations that I
know are available.
One of my interests is fast Tanimoto/similarity searching. There are a
number of papers on fancy ways to organize fingerprints, often with
hierarchical trees. I've tried implementing these only to find that
they were still slower than my fast linear search with the Swamidass
and Baldi cutoff optimization. One of the papers showed benchmarks of
their improved algorithm vs the linear+Baldi optimization code, and
show that theirs was still faster. I looked at their code and realized
that they had a horribly slow Tanimoto calculation. I think (without
strong evidence!) that these algorithms can't beat a fast linear
search, so I want to have a reference system for them to show me that
I'm wrong.
Early in 2010 I proposed an exchange format. Two formats, actually,
one in text for easy generation and exchange and the other a binary
format for fast analysis. The binary format is on hold while I work on
the text format, in part because the text format is much more useful.
A format by itself might look pretty, but it isn't useful. My
experience is that functionality is the major reason to use a given
format, not prettiness. I wanted to provide a good set of initial
tools to encourage people to use the chemfp format. So, after about 40
days of work over the last 1.5 years, I present to you:
close failed in file object destructor:
Error in sys.excepthook:
Original exception was:
sigh. Well, there's a reason this is listed as "alpha".)
Now that I have the queries and targets, I'll use simsearch
and do the default similarity search, which finds the nearest k=3
targets to the query according to the Tanimoto similarity.
This output format is still somewhat experimental. I'm looking for
feedback. You can see it's in the same "family" as the fps format,
with a line containing the format and version, followed by key/value
header lines, followed by data. I'm still not sure about what metadata
is needed here (do I really need the source and target filenames?
Should I also have the date?) so feedback much appreciated.
You can of course specify different values of -k and set a
--threshold. You can also search for just --counts, eg, if you want to
find how many targets are within 0.4 of the query but you don't
actually care about the identifiers.
There's a lot more in the package. See rdkit2fps,
ob2fps,
and oe2fps
for examples of how to generate new fingerprints from the three supported toolkits.
Now, I admit, some of it's buggy but all the examples in the wiki
documentation do work -- I skipped the ones that didn't!
I don't usually release "alpha" tools but I'm really looking for
feedback. Kick the tires and let me know what you
think! (And if you want to fund me - all the better. I am a consultant
you know. :) )