This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Generating molecular fingerprints with OpenBabel
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
This is part 2 of an on-going series dealing with molecular fingerprints. Part 1.
I am not yet going to discuss implementation details of how to
generate molecular fingerprints. I'll assume you have access to a
program which generates them for you. This might be a commercial
program like from Daylight or from
Mesa Analytics.
For those following along from home (or don't have license for those,
or want to dig into the step-by-step details of how to handle one of
those program), install OpenBabel.
It implements three different
fingerprint types:
FP2 (default) - linear fragments of length 1 to 7 (with some exceptions) using a hash code generating bits 0≤bit#<1021
FP3 - functional groups based on 55 SMARTS patterns (see "patterns.txt")
FP4 - functional groups based on 307 SMARTS patterns (see "SMARTS_InteLigand.txt")
Why are there two different file formats?
The FP3 and FP4 fingerprints are substructure fingerprints based on
SMARTS patterns listed in text files. Sadly, while I think the two
files should be in the same format, they aren't. The first looks
like:
In looking through the source I found this comment in the method
"ReadPatternFile" in the "fingerprints/finger3.cpp".
// Christian Laggner's format: SMARTS at end of line
This is the code which parses the FP3 and FP4 formats. (Yes, even
though it's named 'finger3.cpp' it also parses the FP4 format.)
Specially this comment comes from the section which figures out the
specific file format. My most likely conjecture is that the FP3
format existed then someone decided to add the FP4 definitions, using
SMARTS patterns that someone else defined.
There's a tradeoff here: should code be changed to fit data or data
changed to fit code? How malleable is code, and how much should you
change someone else's data? In this case the implementer decided the
easiest solution was to keep the original format unchanged; perhaps
because it would be easier to import updates in the future. I would
have reformated the file to fit the expected format, perhaps keeping
the conversion program around to handle any updates.
Is it better to implement more features but haphazardly, or fewer
features but more cleanly? I'm biased much futher towards the second
than most people seem to be, and I think that the combined solution of
"more features and cleanly" is generally posssible. Perhaps
scientific software is worse than average because most people's
explicit goal is to get some scientific project done, and not to
support the needs of unknown others in the future. Or because most
scientists aren't trained programmers? Or perhaps I have an
unrealistic expectation of how green the grass might be on the other
side of the fence.
Generating fingerprints with Babel
Babel generates two types of fingerprint formats. One is an index
file used to store precomputed fingerprints when doing a substructure
search. It's a binary file format that's a bit too complex to talk
about here. The other is, I'm told, more of a debugging aid that
saves the fingerprints as hexadecimal strings. It's much easier to
show and understand so I'll use it.
I happen to have a small set of pharmacologically popular compounds
that I got from JJ some years back. It was for an example data set he
put together while at Daylight.
I'll convert that to the text fingerprint format ".fpt" file using
FP2-style fingerprints. Here I'll specify "-xfFP2" but that's
optional and if I left it out I would still get the FP2 fingerprints.
The "-xh" option says to include the hex fingerprints in the output
file:
The format isn't described anywhere but it's easy to reverse engineer.
Each input compound has an output record. The first compound is
special. I'll call it the query compound and the result the query
report. The query report lists the compound name, the number of bits
set in it, and the fingerprint itself. I'll call the others the
target compounds and the target reports. Every target report lists
the compound name, the Tanimoto score between the target fingerprint
and the query fingerprint, and the target fingerprint.
If you experiment some with this some more (or read the documentation)
you'll occasionally see target reports which look like:
The "Possible superstructure" line means that every on bit in the
query fingerprint is also on in the target fingerprint. This can be
useful when doing substructure searches. A subgraph search is slower
than comparing bits in a fingerprint. With the right type of
fingerprint (like the CACTVS substructure keys; note that not all
fingerprints work for this) then if a bit is set in the query but not
in the target fingerprint then there's no way that the target can be a
superset of the query, and there's no need to do the more complicated
subgraph isomorphism search.
Back to the output. I was a bit surprised to see how much extra work
was being done. I only expected the fingerprints and not the Tanimoto
nor the possible superstructure test. These aren't hard to do,
especially given the input parsing and molecular perception that had
to be done, but it wasn't something I expected.
Probably there's history which explains why this happened. My theory
is that the code started as a quick and easy way to do single compound
tanimoto searches of a structure file, without making an index or
other intermediate file. After time someone wanted to get the actual
fingerprint values and figured the easiest solution was to add an
option to include that in the output.
The output for each record contains a fingerprint with 6 lines of 5
groups, plus 2 other groups in each fingerrprint. A group has 8 hex
values and each hex value is 4 bits, so there are (6*5+2)*8*4 = 1024
bits in each fingerprint. The documentation (and the code, in
fingerprints/finger2.cpp:CalcHash) says there are only 1021 bits set.
Therefore my awsome math skill tell me there must be 3 bits in the
output representation which are always 0. Which 3? How are the
abstract fingerprints represented in memory? How are they represented
in the output?
Endianness
Most people these days deal with Intel processors and don't have to
worry about different endian machines.
It's something you should at least know about in case you want to
write portable software or to understand the chain of operations from
abstract fingerprint bit set to actual bit set in hardware.
Endianness can refer to bit endianness and to byte endianness. Yes,
it's complicated, but important enough that I'll go into some details.
First, bit endianness. How do you think about fingerprints? Most
likely your model is of the bits all in a row. Is the first bit (bit
0) on the left of the row or the right? I think of fingerprints as an
array, with 0 being the left-most digit. This is a little-endian
view, and in a 64 bit fingerprint it looks like:
while if you think of fingerprints as a huge number in base-2, with
the 1s digit on the right, then the 10s digit, then the 100s,
etc. then you are a big-endian person.
Both are equally valid, which is why historically there were such
religious wars over which was best.
How does OpenBabel represent the fingerprint? There are two forms,
one for in-memory use and the other when it writes to this text. (The
index file probably has its own format, but I haven't looked into that
yet.) Here's the relevant OpenBabel code:
The data is stored in a "vector<unsigned int> vec". Bit 0 is in
vec[0]:0, bit 1 is in vec[0]:1, bit 32 is in vec[1]:0, and so on.
Arrays are laid out in little endian order, so the 8 bits of vec[0]
come before the 8 bits of vec[1], which are before those of vec[2].
Here's where byte enddianness comes into play. Intel CPUs use
little-endian byte order in a word, although in each byte the bits are
in big endian order. (I think.) That is, in a 32 bit integer the
actual bit order is:
Little endian is sometimes refered to as "1234" order because the
smallest byte goes first, then the the 2nd smallest, then the 3rd,
then the 4th. The equivalent notation for big endian is "4321".
Does my machine do what I think it does?
I said that my machine is little-endian and that it arranges data in a
certain way. As Reagan would say, "trust, but
verify. How do I verify that my machine does what I think it
does? The easiest way is to write a block of memory to disk then look
at the bytes. I'm a Python programmer but I'm going to do this in C
because then I'm more certain I know exactly what the progam is doing.
The "od" program (short for "octal dump") has various options to
describe how to convert the bytes into something readable. In this
case I'm displaying each character as a hex value, where "20" is the
hex value for the binary value "00100000".
The "00 00 00 00" is the value 0. The
"01 00 00 00" is the 1, and you can see that the 1 went
into the first byte, as expected on a little endian machine. The
remainder of the test cases show that all bytes are in the expected
order.
How OpenBabel saves in-memory fingerprints to disk
I've figured out the in-memory representation, but that's not the same
as the fingerprint format babel generates with the "-xh" option. I
could reverse engineer it, because there's only a few sane ways to
write a fingerprint, but since I've got the code I might as well use
it. The conversion is done in formats/fingerprintformat.cpp in
WriteMolecule:
This appears to scramble the order up even more, but actually converts
the entire fingerprint into big-endian format at the bit level! I
said that arrays are laid out in little endian order, with vec[0]
first. The for-loop here goes in reverse order, so it the output will
be in big endian order at the word level.
Then because it's using C++ stream formatting to convert each unsigned
int into hex words of at least size 8 (note that there will be a
format change on a system where unsigned int takes 64 bits). The hex
representation of an unsigned int is bitwise big endian, so the entire
fingerprint is in big endian order.
Perhaps this helps a bit. Given a 64 bit fingerprint, here are some
different representations:
== fingerprint index (the way I think of them) ==
00000000 00111111 11112222 22222233 33333333 44444444 44555555 55556666
01234567 89012345 67890123 45678901 23456789 01234567 89012345 67890123
== bits (in memory, Intel little-endian order) ==
word 0 word 1
byte 0 byte 1 byte 2 byte 3 byte 4 byte 5 byte 6 byte 7
[00000000 11111100 22221111 33222222][33333333 44444444 55555544 66665555]
[76543210 54321098 32109876 10987654][98765432 76543210 54321098 32109876]
== bits (on disk, big endian order in hex) ==
byte 0 byte 1 byte 2 byte 3 byte 4 byte 5 byte 6 byte 7
66665555 55555444 44444444 33333333 33222222 22221111 11111100 00000000
32109876 54321098 76543210 98765432 10987654 32109876 54321098 76543210
I'm not sure that helped, but perhaps it did. Notice how the end
result has the bits in reverse order from what they are in abstract
space? Isn't that cool? I wonder if it was designed to be that way.
Does OpenBabel do what I think it does?
Again, how can I verify that the on-disk fingerprint representation is
what I think it is? I generated a bunch of FP2 fingerprint earlier,
and I know 3 bits are never set. By looking at the code, which does a
"%1021" (modulo 1021), those are going to be bits 1021, 1022, and
1023, which means the first byte of the hex fingerprint can only be 0
or 1, because the first three bits will always be 0. Inspecting the
generated fingerprints, nicotine is the only compound which has a 1
there, and all others have a 0. I might be right, but 10 compounds
aren't enough to be comfortable that that observation is meaningful.
For example, note that the last hex digit in 10 cases was 0, which is
compatible with a little-endian output format.
I need more data. I've got a subset of the venerable NCI data set
hanging around as an SD file so I converted that into "nci_subset.fpt"
and wrote a program to extract all of the hex fingerprints from it.
That was really easy. Python supports arbitrary length hex integers,
in big-endian order, so when I get a fingerprint like
I'll accumulate all of these values by doing a successive bitwise-'or'
of each number. This works because all numbers are internally
represented in binary. By doing a bitwise-or a given bit will be set
if any of the input fingerprints was set in that position.
Before doing that, another example might help. First, some moral
support. If you're jumping into this from chemistry, with no
experience in binary representation or byte order, then you're having
to learn a different way of thinking about numbers. I hope this level
of detail helps by pointing out how things work step-by-step. While
those who have a strong CS background probably skipped this whole
section.
I'll bitwise-union a few numbers (74, 35, and 76), as expressed in
binary, and show that that result is equal to the binary number which
has 1 where at least one of the original numbers had a 1. Here I'm
using the interactive prompt in Python. I call "int()" with "2" to
tell it that the string is in base-2 format.
Once that's made sense, here's the Python code I used to generate the
bitwise-union of all of the fingerprints in the NCI subset:
#filename = "drugs.fpt"
filename = "nci_subset.fpt"
# Accumulate the union of all the fingerprint bits
union_bits = 0
# Extract only the hex value for each compound fingerprint
# This assumes the fingerprint file is small enough to fit into memory!
# (The parser was easier to write this way.)
for block in open(filename).read().split("\n>"):
header, remainder = block.split("\n", 1)
if remainder.startswith("Possible superstructure"):
_, remainder = remainder.split("\n", 1)
# Remove spaces and newlines to get only a hex number
fp_str = remainder.replace(" ", "").replace("\n", "")
# Python handles 1024 bit long integers without a problem
fp = int(fp_str, 16)
# Merge all the one bits together
union_bits |= fp
# This should start "0x1" if the first 3 bits are 0
print hex(union_bits)
print len(hex(union_bits)[2:-1]), "bytes"
The output I got was, with some folding of the very long line to make
it fit into this text:
Yes indeed, the first 3 bits are never 1, though there is one bit
(where the 'd' is) which is also never 1. Since bits 1021, 1022 and
1023 of the fingerprint should always be 0, this is pretty conclusive
evidence that the hex fingerprints in Babel's "fpt" file format are
written in big-endian order. The left-most bit of the written
fingerprint is number 1023 and the right-most bit is number 0.
Checking the substructure fingerprint order
Anyone who knows me knows I like sweating the details. I've been
checking the FP2 fingerprints, but perhaps the FP3 and FP4
fingerprints use a different ordering. Those are feature fingerprints
based on SMARTS patterns. The FP3 definition in "patterns.txt"
starts:
Cool, it's consistent with what I expected! Always a nice thing.
Is knowing the bit position useful?
In most cases this level of detail isn't important. If you are using
hash fingerprints then the bit doesn't really tell you much. It's
more important if you are using a feature fingerprint because each bit
can be mapped back to a chemically understandable property. Suppose
you do some sort of clustering or rule based learning on the bits and
find that bit 592 is the most divisive bit. A chemist might want to
know what that bit "means", so being able to get from the bit back to
the rule is useful.
It's also important if you want interoperability between different
tools. If one tool generates bits in a different order than another
then they aren't going to work well together.
OpenBabel does support a "-xs" option to explain which bits are set,
and a "-xu" to explain which bits are not set. Only one of these is
valid, and if used then "-xf" does not work. Grr. On top of that,
the [+] and [-] definitions don't have a human readable comment, so I
can't even show you what the output looks like for this example. I've
filed a few bug reports against the fingerprint generation code in
Babel. (Since I wrote this these bugs have been fixed in version
control.)
The point of the previous expositions was to make sure you and I know
where fingerprint data comes from and how it looks like at the bit
level. The next step is to support a Tanimoto search of a set of
fingerprints. Stay tuned to this bat channel!