This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Faster block Tanimoto calculations
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
After much analysis and digging in part
4 I managed to get my C extension to Python to compute
fingerprints pretty quickly. I estimated that a fingerprint search of
PubChem (19,218,991 1024 bit fingerprints) would take about 13.5
seconds, but that's still almost a factor of two worse than my
estimate of 8 seconds. What's wrong?
To start with, the data bandwidth from the disk is not the limiting
factor. Processing each byte of my test set using file reads takes
0.26 seconds
for (i=0; i<num_bytes; i++) {
b += _popcount_counts[targets[i]];
c += _popcount_counts[query[i]&targets[i]];
}
The most obvious difference is that I'm working with 32 bits at a
time. This won't work if the fingerprint has, say, 40 bits in it. On
the plus side, I don't need to check for the end of the loop after
every 8 bits, so that reduces the branch tests by 75%. (Branches in
modern computers cause performance problems. Some details at the
Wikipedia page on Branch
predictor.
Out of curiousity I generated the assembly code for the corresponding
popcount functions (using "gcc -S"). Here's the relevant source:
int byte_lookup(int num_bytes, char *query) {
int a=0, i;
for (i=0; i<num_bytes; i++) {
a += _popcount_counts[query[i]];
}
return a;
}
int int_lookup(int num_bytes, int *query) {
int a=0, q, i;
for (i=0; i<num_bytes/4; i+=4) {
q = query[i];
a += (_popcount_counts[q & 0xff] +
_popcount_counts[q >> 8 & 0xff] +
_popcount_counts[q >> 16 & 0xff] +
_popcount_counts[q >> 24 & 0xff]);
}
return a;
}
The first has 7 instrutions and one branch test but it's called 4
times more often than the second, which has 18 instructions and 1
branch test. 4*7=28, which is more than 18. I don't know how long it
takes the CPU to run those instructions, so I'll just assume that all
instructions take the same time, or about 33% faster. This isn't
always true because some instructions are faster than others, and
pipelining and other CPU tricks make things even more complicated.
My Python extension using bytes for the lookups is about 40% slower
than my high-speed C test case using ints. These are commensurate
numbers, or at least suggestive. It looks like I should rework my
code to use integers. But there's something I need to worry about
first.
__builtin_popcount and data alignment
Earlier I talked about using the GNU C/C++ "__builtin_popcount"
extension to compute the popcount of an integer but I've been using my
own implementation based on the _popcount_counts lookup table using
bytes. I did this for a few reasons. You might not be using the GNU
compiler, though that's a low probability. You might want to support
fingerprints which are a multiple of 8 but not of 32, though all my
tests are for 1024 bit fingerprints. The main reason was I was
worried about memory
alignment concerns.
Some CPUs require non-character arrays to be memory aligned. Even for
CPUs which don't have that requirement (like Intel machines),
non-aligned access is slower than aligned access. I'm using Python to
pass a character pointer to a C function, but I don't know if that
pointer is correctly aligned to be used directly as an integer
pointer. I decided it was best to ignore the problem until I got the
code working.
The Python string object is implemented in C, and my C extension gets
a pointer to the underlying string data. Is it already aligned such
that I can cast the string directly into a "int *"? I could figure it
in a couple of ways. I could look at the source code and working out
what's going on, or I can run the code and asking it. I'll do both,
starting with the first one, which is the hard one, first.
The layout for the string class is in "Include/stringobject.h" and is:
typedef struct {
PyObject_VAR_HEAD
long ob_shash;
int ob_sstate;
char ob_sval[1];
/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
* ob_sval[ob_size] == 0.
* ob_shash is the hash of the string or -1 if not computed yet.
* ob_sstate != 0 iff the string object is in stringobject.c's
* 'interned' dictionary; in this case the two references
* from 'interned' to this object are *not counted* in ob_refcnt.
*/
} PyStringObject;
The "PyObject_VAR_HEAD" is used ... well, if you're really interested
then read the documentation in "Include/object.h". The summary is
that it's a macro which in non-debug builds will expand to
Py_ssize_t ob_refcnt;
struct _typeobject *ob_type;
Py_ssize_t ob_size; /* Number of items in variable part */
Count up the space needed for the variables and you'll see that the
"ob_sval[1]" occurs on a multiple of 4 bytes. It is aligned such that
it can be cast into an integer pointer without problems.
The easier way is to add a printf statement in the block_similarity
function call, like this:
This prints the pointer location and the location modulo 4 (for 32 bit
'int' alignment) and module 8 (for 64 bit 'long long' alignment). If
it's integer aligned then the modulo result should be 0. When I ran
this I got:
At 0x13f1884 0 4
This means the array is integer aligned (because of the '0') but not
long aligned (because of the '4'). This agrees with what I figured
out from looking at the typedef layout. I can therefore cast the
"char *" pointer to an "int *" then use __builtin_popcount on the
values, and not worry about alignment problems. On the other hand, I
can't use __builtin_popcountll (the difference is the 'll' a the end;
this one is for "long long"s).
My benchmark
I'm going to focus on making the Tanimoto calculations faster. As I
showed in earlier essays, my program's total run-time was only a rough
estimate for the calculation time. It included Python's start-up cost
(importing numpy took 0.25 seconds!) and the implementation choice of
reading a file vs. using a memory mapped file.
I created a new benchmark program called "time_block_similarity.py"
which does exactly what it says. I'll continue to use the
"nci.binary_fp" file as my reference data file, and to avoid disk
issues I'll read the entire file into memory before doing processing.
It's only 150 MB and I have more than that amount of free memory so
there shouldn't be any swapping.
# time_block_similarity.py
import time
import tanimoto
# This is 155,667,200 bytes; 1,216,150 1024-bit fingerprints
targets = open("nci.binary_fp", "rb").read()
# Use the first 1024 bits as the query
query = targets[:1024//8]
t1 = time.time()
results = tanimoto.block_similarity(query, targets)
t2 = time.time()
N = len(results)
print "%s calculations in %.3f seconds; %.0f per second" % (N, t2-t1, N/(t2-t1))
print "Estimated PubChem time: %.1f seconds" % (19218991 * (t2-t1)/N,)
print "First few are:"
for x in results[:5]:
print x
Using my existing byte lookup implementation my fastest-of-three time
was:
1216150 calculations in 0.750 seconds; 1620963 per second
Estimated PubChem time: 11.9 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
That's the time I want to beat.
Faster block_similarity using integers
Changing libtanimoto.c:block_similarity to use integers instead of
characters is not hard. The following assumes that integers are 32
bits, the 'query' and 'targets' are integer aligned, and that
'num_bytes' is a multiple of 4.
/* This is added to libtanimoto.c */
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
int a=0, b=0, c=0, i;
int *query_as_int = (int *) query;
int *targets_as_int = (int *) targets;
int num_ints_in_fp = num_bytes / sizeof(int);
unsigned int fp_word;
/* precompute popcount(query) */
for (i=0; i<num_ints_in_fp; i++) {
fp_word = query_as_int[i];
a += (_popcount_counts[fp_word & 0xff] +
_popcount_counts[fp_word >> 8 & 0xff] +
_popcount_counts[fp_word >> 16 & 0xff] +
_popcount_counts[fp_word >> 24 & 0xff]);
}
/* For each fingerprint in the block */
while (num_targets-- > 0) {
/* compute popcount(target) and popcount(target&query) */
b=0; c=0;
for (i=0; i<num_ints_in_fp; i++) {
fp_word = targets_as_int[i];
b += (_popcount_counts[fp_word & 0xff] +
_popcount_counts[fp_word >> 8 & 0xff] +
_popcount_counts[fp_word >> 16 & 0xff] +
_popcount_counts[fp_word >> 24 & 0xff]);
fp_word &= query_as_int[i];
c += (_popcount_counts[fp_word & 0xff] +
_popcount_counts[fp_word >> 8 & 0xff] +
_popcount_counts[fp_word >> 16 & 0xff] +
_popcount_counts[fp_word >> 24 & 0xff]);
}
/* Append the result and go on to the next fingerprint */
*results = ((double)c)/(a+b-c);
results++;
targets_as_int += num_ints_in_fp;
}
}
and my best-of-three output from this is:
1216150 calculations in 0.411 seconds; 2957097 per second
Estimated PubChem time: 6.5 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
That's almost half the time! (45% faster if you want to be picky.)
I'm also at about the time I predicted from my prototype. But I can
do better.
Loop unrolling
One reason the integer based code became faster was it's now doing one
memory fetch instead of 4. Another reason is that it does about 1/4th
the number of branches. Branches cause CPUs to cry. I now assume the
fingerprint length is a multiple of 4 bytes, so I can do four
calculations before checking if I've reached the end.
This technique is called loop unrolling.
(Despite what Wikipedia claims, most people use the term 'loop
unrolling' and do not use 'loop unwinding'.) As a simple example,
here's a snippet of code which would compute the bitcount for a
fingerprint of length 256 bits, which is 8 32-bit integers.
This new version contains no tests and no branches. CPUs like that,
and so do compilers.
In my benchmark I know the test fingerprints contain 1024 bits. I
want to see if this idea is useful so I'll write a version which
completely unrolls the loop in the block_similarity function which
computes 'b' and 'c'. The downside with unrolling is there's a lot
more code. I'll work around that a bit by using a macro, but the
result is still ugly.
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
int a=0, b=0, c=0, i;
int *query_as_int = (int *) query;
int *targets_as_int = (int *) targets;
int num_ints_in_fp = num_bytes / sizeof(int);
unsigned int fp_word;
/* precompute popcount(query) */
for (i=0; i<num_ints_in_fp; i++) {
fp_word = query_as_int[i];
a += (_popcount_counts[fp_word & 0xff] +
_popcount_counts[fp_word >> 8 & 0xff] +
_popcount_counts[fp_word >> 16 & 0xff] +
_popcount_counts[fp_word >> 24 & 0xff]);
}
/* For each fingerprint in the block */
while (num_targets-- > 0) {
/* compute popcount(target) and popcount(target&query) */
b=0; c=0;
i=0;
#define _B_AND_C_POPCOUNT \
fp_word = targets_as_int[i]; \
b += (_popcount_counts[fp_word & 0xff] + \
_popcount_counts[fp_word >> 8 & 0xff] + \
_popcount_counts[fp_word >> 16 & 0xff] + \
_popcount_counts[fp_word >> 24 & 0xff]); \
fp_word &= query_as_int[i++]; \
c += (_popcount_counts[fp_word & 0xff] + \
_popcount_counts[fp_word >> 8 & 0xff] + \
_popcount_counts[fp_word >> 16 & 0xff] + \
_popcount_counts[fp_word >> 24 & 0xff]);
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
/* Append the result and go on to the next fingerprint */
*results = ((double)c)/(a+b-c);
results++;
targets_as_int += num_ints_in_fp;
}
}
Does it make any difference?
1216150 calculations in 0.306 seconds; 3968043 per second
Estimated PubChem time: 4.8 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
Nice! This is about 25% faster than the previous best time.
16 bit lookup table
There are still more tricks up my sleve. I wondered how much time was
spent in computing the popcount for a fingerprint so I removed the
calculation of 'b'
from my unwound macro definition. This means 'b' will always be 0,
and the scores will be completely wrong. The resulting report was:
1216150 calculations in 0.185 seconds; 6562436 per second
Estimated PubChem time: 2.9 seconds
First few are:
inf
2.41463414634
1.91666666667
2.04347826087
2.11111111111
which means about 40% of the time is spent computing one of the
fingerprints. Which makes sense as the entire point is to compute two
fingerprints and I'm trying to avoid other overhead.
Can I make those faster? I've been using an 8 bit lookup table. I
could cut the number of operations in half by using a 16 bit lookup.
That table only takes about 64K of memory, which fits into cache these
days. Would that be faster?
The 8 bit lookup table was small enough that I could write it in 10
lines of code. The 16 bit table I did for my own proof-of-concept
testing took 2050 lines. I don't want to list that here. Instead,
I'll define space for the entire array and an 'init' function in
libtanimoto.c. After the Python module, via ctypes, loads the module
it will call the initialization function.
It's been a long time since I've shown the full "libtanimoto.c" and
"tanimoto.py" files. I've made lots of small changes over that time
so it's best to resynchronize. Here are complete copies of each file,
with the important new parts in bold:
/* libtanimoto.c */
/* Define space for the 16-bit lookup table */
/* Bootstrap using values for the 8-bit table */
static char _popcount_counts[65536] = {
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};
/* Bootstrap the table. This will be called by 'tanimoto.py' */
void init_tanimoto(void) {
int i;
/* If I know you are using GCC then you could call __builtin_popcount */
for (i=256; i<65536; i++) {
_popcount_counts[i] = (_popcount_counts[i & 0xff] +
_popcount_counts[i >> 8]);
}
}
/* Compute the Tanimoto similarity between two fingerprints */
double similarity(int num_bytes,
unsigned char *query, unsigned char *target) {
int a=0, b=0, c=0, i;
for (i=0; i<num_bytes; i++) {
a += _popcount_counts[query[i]];
b += _popcount_counts[target[i]];
c += _popcount_counts[query[i]&target[i]];
}
return ((double)c)/(a+b-c);
}
/* Compute the similarity beteen a query fingerprint and a block
of target fingerprints. Result values go into 'results'. */
/* This assumes num_bytes == 32 and that query and targets are int aligned */
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
int a=0, b=0, c=0, i;
int *query_as_int = (int *) query;
int *targets_as_int = (int *) targets;
int num_ints_in_fp = num_bytes / sizeof(int);
unsigned int fp_word;
/* precompute popcount(query) */
for (i=0; i<num_ints_in_fp; i++) {
fp_word = query_as_int[i];
a += (_popcount_counts[fp_word & 0xffff] +
_popcount_counts[fp_word >> 16]);
}
/* For each fingerprint in the block */
while (num_targets-- > 0) {
/* compute popcount(target) and popcount(target&query) */
b=0; c=0;
i=0;
#define _B_AND_C_POPCOUNT \
fp_word = targets_as_int[i]; \
b += (_popcount_counts[fp_word & 0xffff] + \
_popcount_counts[fp_word >> 16]); \
fp_word &= query_as_int[i++]; \
c += (_popcount_counts[fp_word & 0xffff] + \
_popcount_counts[fp_word >> 16]);
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
_B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
/* Append the result and go on to the next fingerprint */
*results = ((double)c)/(a+b-c);
results++;
targets_as_int += num_ints_in_fp;
}
}
and the high-level Python interface to it:
# tanimoto.py - wrapper to the libtanimoto shared library
import ctypes, ctypes.util
import numpy
# Load the shared library (the extension depends on the OS type)
libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))
# Call its initialization function
_init_tanimoto = libtanimoto.init_tanimoto
_init_tanimoto.argtypes = []
_init_tanimoto.restype = None
_init_tanimoto()
# Annotate the function call argument and result types
_similarity = libtanimoto.similarity
_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
_similarity.restype = ctypes.c_double
# Define the high-level public interface as a Python function
def similarity(query, target):
n = len(query)
m = len(target)
if n != m:
raise AssertionError("fingerprints must be the same length")
return _similarity(n, query, target)
_block_similarity = libtanimoto.block_similarity
_block_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p,
ctypes.c_int, ctypes.c_char_p,
ctypes.POINTER(ctypes.c_double)]
_block_similarity.restype = None
# The high-level public interface
def block_similarity(query, targets):
n = len(query)
m = len(targets)
if m % n != 0:
raise AssertionError("length mismatch")
num_targets = m//n
result = numpy.zeros(num_targets, numpy.double)
result_ctypes = ctypes.cast(result.ctypes.data,
ctypes.POINTER(ctypes.c_double))
_block_similarity(n, query, num_targets, targets, result_ctypes)
return result
When I ran it I found:
1216150 calculations in 0.231 seconds; 5269985 per second
Estimated PubChem time: 3.6 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
The code is about 25% faster than the previous fastest version. The
original byte lookup implementation at the start of this essay took
0.750 seconds, so the code is now about 3 times faster. Although it's
not good enough for general use because it's hard-coded to 1024 bit,
integer aligned fingerprints.
That's okay. I can add a test for that and if the input meets all the
right criteria then call the specialized function, otherwise use the
more general purpose but slower function. Not something I'm going to
worry about now, but feel free to do it yourself. It might start
something like:
void block_similarity(int num_bytes, unsigned char *query,
int num_targets, unsigned char *targets,
double *results) {
/* ensure the data is all integer aligned */
if (query % 4 == 0 && targets % 4 == 0) {
switch(num_bytes) {
case 128: .. specialized code for 1024 bits ..
case 64: .. specialized code for 512 bits ..
case 40: .. specialized code for the MDL MACCS 320 keys ..
case 32: .. specialized code for 256 bits ..
.. more special cases ..
default: break;
}
}
... general purpose code goes here ..