This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Block Tanimoto calculations
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
In part
3 I wrote a C function to compute the Tanimoto similarity score
between two fingerprints. The Tanimoto is most often used when
searching a large set of target compounds. Some questions are "what's
the Tanimoto similarity for every compound in the data set", "what are
the 10 target compounds that are most similar to the query, sorted by
similarity, and what are their similarity values?" and "how many
compounds are at least some specified similarity threshold to the
query compound?"
With the Tanimo.similarity function extension I defined earlier, it's
simple to implement these algorithms. Here's one which prints the
Tanimoto score for every compound in a given ".binary_fp" file using
the first compound as the query.
# print_similarity.py
import sys
import tanimoto
filename = "drugs.binary_fp"
if len(sys.argv) > 1:
filename = sys.argv[1]
# Get the query fingerprint
f = open(filename, "rb")
query_fp = f.read(128) # 1024 bits
f.seek(0)
while 1:
target_fp = f.read(128) # 1024 bits
if not target_fp:
# reached the end of file
break
print tanimoto.similarity(query_fp, target_fp)
How fast is this code? I took the PubChem SD file for compounds
"09425001 - 09450000" and converted it into a Babel fingerprint file,
and thence into a ".binary_fp" file.
Decent. It means that PubChem (19,218,991 compounds) would take about
2.5 minutes, which is several times faster than the 7 minutes I
estimated in pure Python, but still quite a bit slower than the 20
seconds I estimated was the limit using pure C.
Push blocks of targets to the C code
Where is the extra time being spent? Probably because I'm doing a lot
of work in Python in order to do a very little bit of work in C.
Every Tanimoto calculation requires a Python function, plus another to
read 1024 bits, plus a test to see if it's the end of file. Python
function calls take a lot of time compared to C function calls.
I could do some profiling to figure out if this guess is correct but
it's about as easy to implement another solution and compare the
timings. Rather than computing a single Tanimoto at a time I'm going
to pass many fingerprints to the C function, and get a list of
Tanimoto values back.
One of the points I've been bring up in these writings is that there
are many solutions to a problem. In this case I'm going to pass a set
of fingerprints to a C function, but I get to decide how I want to
pass the data in. I could use a Python list of strings, or a Numpy
array of unsigned integers, or many more possibilities. The solutions
I'm exploring are based on experience but it doesn't mean I'm right,
nor does it mean I can get to a good solution without some
exploration.
tanimoto.block_similarity - my desired API
For my solution now I'm going to pass the set of query fingerprints as
a single string, with the first fingerprint as the first 128 bytes,
the second as the next 128 bytes, and so on. I'll call this single
string which contains many fingerprints a "block" so my new function
will be called "block_similarity." I want the result to be something
like a Python list, where I can iterate over the results for each
fingerprint, in order. In code, I want the main loop of the
calcultion to act like the following interface:
while 1:
target_fps = f.read(128*1024) # 1024 fingerprints of 1024 bits each
if not target_fps:
# reached the end of file
break
for score in tanimoto.block_similarity(query_fp, target_fps):
print score
I want the results returned as a list but C functions can only return
a simple data type like an integer or a pointer. Again, there are
several possible solutions. I'll have my Python wrapper module
allocate space for the results and pass a pointer to that space to the
C function, which assigns the correct values.
libtanimoto.c:block_similarity
With most of the hard work of allocating results handled by some
yet-to-be-written Python code, the C code isn't that hard:
/* This is added to libtanimoto.c */
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
int a=0, b=0, c=0, i;
/* precompute popcount(query) */
for (i=0; i<num_bytes; i++) {
a += _popcount_counts[query[i]];
}
/* For each fingerprint in the block */
while (num_targets-- > 0) {
/* compute popcount(target) and popcount(target&query) */
b=0; c=0;
for (i=0; i<num_bytes; i++) {
b += _popcount_counts[targets[i]];
c += _popcount_counts[query[i]&targets[i]];
}
/* Append the result and go on to the next fingerprint */
*results = ((double)c)/(a+b-c);
results++;
targets += num_bytes;
}
}
Because I expect to do a Tanimoto calculation using many targets I
could do a bit of optimization. The Tanimoto function is "c/(a+b-c)"
where "a" is the number of on-bits in the query fingerprint. That's a
constant so I compute it once.
tanimoto.py:block_similarity
The hard work of allocating memory goes into the 'tanimoto.py' file.
I decided to use a numpy array
to store the results because they are designed to work with ctypes. I
can pass the underlying data storage area to a C function with only a
'cast'. Numpy also supports functions like 'numpy.sort' so working
with numpy arrays can be more convenient than working with a memory
block allocated directly by ctypes, which is my other likely
possibility. On the other hand it does put a third-party dependency
in the library.
Here's the code:
# This is added to 'tanimoto.py'
_block_similarity = libtanimoto.block_similarity
_block_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p,
ctypes.c_int, ctypes.c_char_p,
ctypes.POINTER(ctypes.c_double)]
_block_similarity.restype = None
import numpy
def block_similarity(query, targets):
n = len(query)
m = len(targets)
if m % n != 0:
raise AssertionError("length mismatch")
num_targets = m//n
result = numpy.zeros(num_targets, numpy.double)
result_ctypes = ctypes.cast(result.ctypes.data,
ctypes.POINTER(ctypes.c_double))
_block_similarity(n, query, num_targets, targets, result_ctypes)
return result
Looks easy, right? You'll be surprised at how long it took to get
that working. I figured that because the numpy array is a
'numpy.double' then ctypes would see it as a ctypes.c_double array,
without needing the cast. Instead, without the case ctypes complains:
Strange. But what I did seems to work, so hopefully others who are
confused will search on the error message and find this essay useful.
After I compiled the modified libtanimoto shared library (using
'scons') and updated the 'tanimoto.py' wrapper library I did a quick
test:
>>> tanimoto.block_similarity("A", "AaB")
array([ 1. , 0.66666667, 0.33333333])
>>>
>>> [hex(ord(c)) for c in "AaB"]
['0x41', '0x61', '0x42']
>>>
In this case the query fingerprint is 8 bits long (the character 'A",
which is the bit pattern 01000001) and the target fingerprint block
contains three 8 bit fingerprints (01000001, 11000001 and 01000010).
The result should be three Tanimoto scores, of values 1.0, 2/3 and 1/3
each. Which I got. It seems to work.
Not shown here is the step where I compared the results of the
"block_similarity" function to one based on the older "similaity"
function, to make sure they gave the same results.
Computing all Tanimoto scores using block_similarity
Once I was convinced I got the right answers, I wrote a
"print_similarity2.py" function which works the same as
"print_similarity.py" (it calculates and prints all of the Tanimoto
scores for the given fingerprint file) except it uses the block-based
similarity calculations.
# print_similarity2.py
import sys
import tanimoto
filename = "drugs.binary_fp"
if len(sys.argv) > 1:
filename = sys.argv[1]
# Get the query fingerprint
f = open(filename, "rb")
query_fp = f.read(128) # 1024 bits
f.seek(0)
while 1:
target_fps = f.read(128*1024) # 1024 fingerprints of 1024 bits each
if not target_fps:
# reached the end of file
break
for score in tanimoto.block_similarity(query_fp, target_fps):
print score
This code takes about 4.1 seconds to run on my 'nci.binary_fp' data
set:
which is quite a bit better than the 9.9 seconds of
"print_similarity.py". Searching PubChem this way should take about
65 seconds, which is about 1/3rd of the speed I expect from C code.
The effects of block size
The code uses a tunable parameter, which is the number of fingerprints
to include in a block. I timed a variety of block sizes to figure out
what is optimal for my machine:
number
of fps time
in block (in s)
1 42.06
2 23.41
5 11.85
10 8.31
25 5.68
50 4.82
100 4.46
1000 4.17
5000 4.17
10000 4.22
100000 5.15
It looks like my chosen size of 1024 is a good one. It gives good
performance while minimizing memory use. Your machine might be
different. (Note that I used my computer to listen to music during
this, which is why the numbers are slightly slower than 4.12 seconds.)
My data set is "only" 155,667,200 bytes long (149 MB). It easily fits
into memory. I was curious on how much time could be saved if I
didn't do any I/O, that is, if I read the entire data file into a
string and only computed the time needed to run "block_similarity" on
that string. Result: 3.30s, or an estimated PubChem search time of 52
seconds. About 0.9 seconds is being spent in Python startup time and
reading the data file.
Why's my code so slow?
Almost all of my time is in C code, yet I'm still at 1/3rd the time I
expect from the fastest possible C code. What's going on? There's
still a few differences between my code and the "fast" code I wrote.
The fast code repeated the same calculations many times, which meant
it didn't need to fetch data from disk and memory to the CPU.
My block_similarity wrapper code allocated memory for the
results
I iterate over the results, and print them to /dev/null
My popcount algorithm uses a lookup table, not __builtin_popcount()
Of these, I'm guessing "print score" probably takes the longest. Each
number is converted (using __str__) into a string and sent to stdout,
only to be redirected to /dev/null. To test that guess I changed
"print_similarity2.py" so the inner loop was:
for score in tanimoto.block_similarity(query_fp, target_fp):
#print score
pass
I no longer print out the score. When I run the code I get:
4.12 to 1.84 seconds is quite a difference! The estimated time to
compute the Tanimoto for all of PubChem (but not print out the
results) is now 30 seconds, which is only 4 times slower than my
expected fastest possible performance (using this approach) of 8
seconds.
Of course, if I'm not going to print the values then there's no need
to iterate over the return values. This iteration is slower than
usual in Python because the data is stored in Numpy array, which hold
C doubles, not Python floats. The iteration has to convert each C
double into a Python float, which also takes time.
I removed iteration from my code so the main loop becomes:
while 1:
target_fp = f.read(128*4096) # 1024 of 1024 bits
if not target_fp:
# reached the end of file
break
tanimoto.block_similarity(query_fp, target_fp)
and the total time for my test set is 1.6 seconds,
or an estimated 26 seconds for all of PubChem. I feel better now, as
I wasn't sure why my code was so much slower than it should be. I
could go back and rewrite my text so you would never see my confusion
and wonderings about this, but where's the fun in that? In this way
you can see some of how I got to where I did.
mmap - memory-mapped files
Earlier I mentioned that the block size was a tuneable parameter. I
don't like tuneable parameters. I prefer self-tuning systems or using
a better algorithm. It turns out there is another solution - a
memory-mapped file. This is an operating system mechanism to treat a
a file as a string in memory instead of using file operations like
'read()'. In theory I would mmap the file to a block of memory then
pass the block directly to to the "block_similarity" function.
Sadly, it doesn't work as easily as I thought it would. Python has a
mmap module
which handles making a memory-mapped file, but the mmap object isn't
compatible with the 'ctypes' interface. It seems that perhaps Python
2.6 makes this easier, but I'm not sure.
Instead I bypassed Python's mmap module and called the mmap function
directly from libc, using ctypes. I ran into some difficulties, which
ended up being because the "off_t" C type is a 64 bit integer, and not
a 32 bit one like I expected. Getting the data types correct is one
of the hard parts of using ctypes.
Here's the code for "print_similarity3.py". It's using the same
libtanimoto:block_similarity function as "print_similarity2.py" but
because the entire file is mapped to a string, I don't have to worry
about reading a block at a time - I can pass the entire mmap'ed file
to the function. (This will create in the PubChem case a 152 MByte
Numpy array of doubles, which may be unexpected.)
# print_similarity3.py
import sys
import os
import mmap
import ctypes, ctypes.util
from ctypes import c_void_p, c_int, c_char, c_int64
import tanimoto
libc = ctypes.CDLL(ctypes.util.find_library("c"))
libc.mmap.restype = c_void_p
libc.mmap.argtypes = (c_int, c_int, c_int, c_int, c_int, c_int64)
filename = "drugs.binary_fp"
if len(sys.argv) > 1:
filename = sys.argv[1]
# Get the query fingerprint and reset
f = open(filename, "rb")
query_fp = f.read(128) # 1024 bits
f.seek(0)
size = os.path.getsize(filename)
# mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset);
ptr = libc.mmap(0, size, mmap.PROT_READ, mmap.MAP_SHARED, f.fileno(), 0)
targets = ctypes.cast(ptr, ctypes.POINTER(c_char*size))[0]
# Don't iterate and print the scores; that takes a long time
# and I want to focus on the computation part.
#for score in tanimoto.block_similarity(query_fp, targets):
# print score
tanimoto.block_similarity(query_fp, targets)
Getting timing numbers for this has been a bit tricky. The first time
I ran this took a few tens of seconds as my Mac swapped pages around.
I had a lot of apps using virtual memory. The second time it went
faster. I've found in general that using mmap'ed files gives
performance profiles that are harder to predict, so I tend to shy away
from them. I also have read strange things about using them with big
files (>1GB, or ast least > main memory) but I don't have the
experience to judge.
PubChem's fingerprint file would be 2.3 GB. If the mmap access scales
to that file size then I can compute all its Tanimoto values in about
20 seconds, which is is impressively fast.
Python and NumPy startup costs
My reference C code could in theory calculate Tanimoto values for
PubChem in 8 seconds so my Python program timings, even with mmap,
seems rather slow. Then again, these aren't computing the same
things. For one thing, by simply scaling the time I'm exaggerating
the overhead of starting Python. That should not be affected by the
number of compounds to compute.
I changed the last few lines of the "print_similarity3.py" file so I
could see how much time was spent in running "block_similarity"
It was 0.90 seconds out of a total time of 1.28 seconds. That means
0.4 seconds is being spend on startup costs? That's unexpected. Of
that, Python's startup time is 0.05 seconds
For whatever reason, 'import numpy' also imports a huge number of
modules, about half of which are submodules.
>>> import sys
>>> len(sys.modules)
28
>>> import numpy
>>> len(sys.modules)
225
>>> len([s for s in sorted(sys.modules) if 'numpy' in s])
127
>>>
Timing just 'libtanimoto.block_similarity' gave me 0.9 seconds. Of
that, the difference between
for (i=0; i<num_bytes; i++) {
b += _popcount_counts[targets[i]];
c += _popcount_counts[query[i]&targets[i]];
}
and
for (i=0; i<num_bytes; i++) {
b += targets[i];
c += query[i]&targets[i];
}
is 0.88 seconds vs. 0.50 seconds. Okay, I'm convinced now that most
of the time is now being spent on the Tanimoto calculation. I figured
out where more of the time is being spent (replacing "i<num_bytes"
with "i<128" brings the time to 0.84 seconds, and my fast code
worked on 32 bit integers instead of doing character lookups), but
I'll work on that in the next essay.