Python Buzz Forum - Faster block Tanimoto calculations

Articles |
News |
Weblogs |
Books |
Forums

Artima Forums | Articles | Weblogs | Java Answers | News

Sponsored Link •

Python Buzz Forum
Faster block Tanimoto calculations

0 replies on 1 page.

Welcome Guest
Sign In

Back to Topic List

Reply to this Topic

Search Forum

Threaded View


Previous Topic		Next Topic

Flat View: This topic has 0 replies on 1 page

Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.

Faster block Tanimoto calculations

Posted: Jul 1, 2008 6:28 PM

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: Faster block Tanimoto calculations Feed Title: Andrew Dalke's writings Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.	Latest Python Buzz Posts Latest Python Buzz Posts by Andrew Dalke Latest Posts From Andrew Dalke's writings

This is part 5 of an on-going series dealing with molecular fingerprints:

Molecular fingerprints, background
Generating molecular fingerprints with OpenBabel
Computing Tanimoto scores, quickly
Block Tanimoto calculations
Faster block Tanimoto calculations

Comments?

After much analysis and digging in part 4 I managed to get my C extension to Python to compute fingerprints pretty quickly. I estimated that a fingerprint search of PubChem (19,218,991 1024 bit fingerprints) would take about 13.5 seconds, but that's still almost a factor of two worse than my estimate of 8 seconds. What's wrong?

To start with, the data bandwidth from the disk is not the limiting factor. Processing each byte of my test set using file reads takes 0.26 seconds

% time wc -l nci.binary_fp
 1098650 nci.binary_fp
0.165u 0.089s 0:00.26 92.3%     0+0k 0+0io 0pf+0w

while the code takes 0.84 seconds to run, and 0.50 seconds if I leave out the actual popcount lookups. Did the code for the popcount lookups change?

Actually, yes, it did. My proof-of-concept code used 32 bit integers and bit shifting:

int popcount(unsigned int x) {
  return (_popcount_counts[x       & 0xff] +
          _popcount_counts[x >>  8 & 0xff] +
          _popcount_counts[x >> 16 & 0xff] +
          _popcount_counts[x >> 24 & 0xff]);
}

while my more recent code uses a byte lookup:

    for (i=0; i<num_bytes; i++) {
      b += _popcount_counts[targets[i]];
      c += _popcount_counts[query[i]&targets[i]];
    }

The most obvious difference is that I'm working with 32 bits at a time. This won't work if the fingerprint has, say, 40 bits in it. On the plus side, I don't need to check for the end of the loop after every 8 bits, so that reduces the branch tests by 75%. (Branches in modern computers cause performance problems. Some details at the Wikipedia page on Branch predictor.

Out of curiousity I generated the assembly code for the corresponding popcount functions (using "gcc -S"). Here's the relevant source:

int byte_lookup(int num_bytes, char *query) {
  int a=0, i;
  for (i=0; i<num_bytes; i++) {
    a += _popcount_counts[query[i]];
  }
  return a;
}

int int_lookup(int num_bytes, int *query) {
  int a=0, q, i;
  for (i=0; i<num_bytes/4; i+=4) {
    q = query[i];
    a += (_popcount_counts[q       & 0xff] +
          _popcount_counts[q >>  8 & 0xff] +
          _popcount_counts[q >> 16 & 0xff] +
          _popcount_counts[q >> 24 & 0xff]);
  }
  return a;
}

This is the inner loop of "byte_lookup"

L5:
        movl    12(%ebp), %esi
        movsbl  (%esi,%edx),%eax
        movl    -16(%ebp), %esi
        addl    (%esi,%eax,4), %ecx
        addl    $1, %edx
        cmpl    %edx, %edi
        jne     L5

and this is the inner loop of "int_lookup".

L13:
        movl    -16(%ebp), %eax
        movl    12(%ebp), %edx
        movl    (%edx,%eax,4), %ecx
        movzbl  %cl,%eax
        movzbl  %ch, %esi
        movl    (%edi,%eax,4), %edx
        addl    (%edi,%esi,4), %edx
        movl    %ecx, %eax
        sarl    $16, %eax
        andl    $255, %eax
        addl    (%edi,%eax,4), %edx
        shrl    $24, %ecx
        addl    (%edi,%ecx,4), %edx
        addl    %edx, -20(%ebp)
        addl    $4, -16(%ebp)
        movl    -24(%ebp), %eax
        cmpl    %eax, -16(%ebp)
        jl      L13

The first has 7 instrutions and one branch test but it's called 4 times more often than the second, which has 18 instructions and 1 branch test. 4*7=28, which is more than 18. I don't know how long it takes the CPU to run those instructions, so I'll just assume that all instructions take the same time, or about 33% faster. This isn't always true because some instructions are faster than others, and pipelining and other CPU tricks make things even more complicated.

My Python extension using bytes for the lookups is about 40% slower than my high-speed C test case using ints. These are commensurate numbers, or at least suggestive. It looks like I should rework my code to use integers. But there's something I need to worry about first.

__builtin_popcount and data alignment

Earlier I talked about using the GNU C/C++ "__builtin_popcount" extension to compute the popcount of an integer but I've been using my own implementation based on the _popcount_counts lookup table using bytes. I did this for a few reasons. You might not be using the GNU compiler, though that's a low probability. You might want to support fingerprints which are a multiple of 8 but not of 32, though all my tests are for 1024 bit fingerprints. The main reason was I was worried about memory alignment concerns.

Some CPUs require non-character arrays to be memory aligned. Even for CPUs which don't have that requirement (like Intel machines), non-aligned access is slower than aligned access. I'm using Python to pass a character pointer to a C function, but I don't know if that pointer is correctly aligned to be used directly as an integer pointer. I decided it was best to ignore the problem until I got the code working.

The Python string object is implemented in C, and my C extension gets a pointer to the underlying string data. Is it already aligned such that I can cast the string directly into a "int *"? I could figure it in a couple of ways. I could look at the source code and working out what's going on, or I can run the code and asking it. I'll do both, starting with the first one, which is the hard one, first.

The layout for the string class is in "Include/stringobject.h" and is:

typedef struct {
    PyObject_VAR_HEAD
    long ob_shash;
    int ob_sstate;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     *     ob_sstate != 0 iff the string object is in stringobject.c's
     *       'interned' dictionary; in this case the two references
     *       from 'interned' to this object are *not counted* in ob_refcnt.
     */
} PyStringObject;

The "PyObject_VAR_HEAD" is used ... well, if you're really interested then read the documentation in "Include/object.h". The summary is that it's a macro which in non-debug builds will expand to

        Py_ssize_t ob_refcnt;
        struct _typeobject *ob_type;
        Py_ssize_t ob_size; /* Number of items in variable part */

Count up the space needed for the variables and you'll see that the "ob_sval[1]" occurs on a multiple of 4 bytes. It is aligned such that it can be cast into an integer pointer without problems.

The easier way is to add a printf statement in the block_similarity function call, like this:

  printf("At %p %d %d\n", query, ((int)query) % 4, ((int)query) % 8);

This prints the pointer location and the location modulo 4 (for 32 bit 'int' alignment) and module 8 (for 64 bit 'long long' alignment). If it's integer aligned then the modulo result should be 0. When I ran this I got:

At 0x13f1884 0 4

This means the array is integer aligned (because of the '0') but not long aligned (because of the '4'). This agrees with what I figured out from looking at the typedef layout. I can therefore cast the "char *" pointer to an "int *" then use __builtin_popcount on the values, and not worry about alignment problems. On the other hand, I can't use __builtin_popcountll (the difference is the 'll' a the end; this one is for "long long"s).

My benchmark

I'm going to focus on making the Tanimoto calculations faster. As I showed in earlier essays, my program's total run-time was only a rough estimate for the calculation time. It included Python's start-up cost (importing numpy took 0.25 seconds!) and the implementation choice of reading a file vs. using a memory mapped file.

I created a new benchmark program called "time_block_similarity.py" which does exactly what it says. I'll continue to use the "nci.binary_fp" file as my reference data file, and to avoid disk issues I'll read the entire file into memory before doing processing. It's only 150 MB and I have more than that amount of free memory so there shouldn't be any swapping.

# time_block_similarity.py
import time
import tanimoto

# This is 155,667,200 bytes; 1,216,150 1024-bit fingerprints
targets = open("nci.binary_fp", "rb").read()

# Use the first 1024 bits as the query
query = targets[:1024//8]

t1 = time.time()
results = tanimoto.block_similarity(query, targets)
t2 = time.time()

N = len(results)
print "%s calculations in %.3f seconds; %.0f per second" % (N, t2-t1, N/(t2-t1))
print "Estimated PubChem time: %.1f seconds" % (19218991 * (t2-t1)/N,)
print "First few are:"
for x in results[:5]:
    print x

Using my existing byte lookup implementation my fastest-of-three time was:

1216150 calculations in 0.750 seconds; 1620963 per second
Estimated PubChem time: 11.9 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936

That's the time I want to beat.

Faster block_similarity using integers

Changing libtanimoto.c:block_similarity to use integers instead of characters is not hard. The following assumes that integers are 32 bits, the 'query' and 'targets' are integer aligned, and that 'num_bytes' is a multiple of 4.


/* This is added to libtanimoto.c */
void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
  int a=0, b=0, c=0, i;
  int *query_as_int = (int *) query;
  int *targets_as_int = (int *) targets;
  int num_ints_in_fp = num_bytes / sizeof(int);
  unsigned int fp_word;

  /* precompute popcount(query) */
  for (i=0; i<num_ints_in_fp; i++) {
    fp_word = query_as_int[i];
    a += (_popcount_counts[fp_word       & 0xff] + 
          _popcount_counts[fp_word >>  8 & 0xff] +
          _popcount_counts[fp_word >> 16 & 0xff] +
          _popcount_counts[fp_word >> 24 & 0xff]);
  }

  /* For each fingerprint in the block */
  while (num_targets-- > 0) {
    /* compute popcount(target) and popcount(target&query) */
    b=0; c=0;
    for (i=0; i<num_ints_in_fp; i++) {
      fp_word = targets_as_int[i];
      b += (_popcount_counts[fp_word       & 0xff] + 
            _popcount_counts[fp_word >>  8 & 0xff] +
            _popcount_counts[fp_word >> 16 & 0xff] +
            _popcount_counts[fp_word >> 24 & 0xff]);
      fp_word &= query_as_int[i];
      c += (_popcount_counts[fp_word       & 0xff] + 
            _popcount_counts[fp_word >>  8 & 0xff] +
            _popcount_counts[fp_word >> 16 & 0xff] +
            _popcount_counts[fp_word >> 24 & 0xff]);
    }
    /* Append the result and go on to the next fingerprint */
    *results = ((double)c)/(a+b-c);
    results++;
    targets_as_int += num_ints_in_fp;
  }
}

and my best-of-three output from this is:

1216150 calculations in 0.411 seconds; 2957097 per second
Estimated PubChem time: 6.5 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936

That's almost half the time! (45% faster if you want to be picky.) I'm also at about the time I predicted from my prototype. But I can do better.

Loop unrolling

One reason the integer based code became faster was it's now doing one memory fetch instead of 4. Another reason is that it does about 1/4th the number of branches. Branches cause CPUs to cry. I now assume the fingerprint length is a multiple of 4 bytes, so I can do four calculations before checking if I've reached the end.

This technique is called loop unrolling. (Despite what Wikipedia claims, most people use the term 'loop unrolling' and do not use 'loop unwinding'.) As a simple example, here's a snippet of code which would compute the bitcount for a fingerprint of length 256 bits, which is 8 32-bit integers.

count = 0;
for (i=0; i<8; i++) {
  count = __builtin_popcount(fp[i]);
}

The snippet uses the loop to make the code shorter. I can unwind it completely, like this:

count = (__builtin_popcount(fp[0]) +
         __builtin_popcount(fp[1]) +
         __builtin_popcount(fp[2]) +
         __builtin_popcount(fp[3]) +
         __builtin_popcount(fp[4]) +
         __builtin_popcount(fp[5]) +
         __builtin_popcount(fp[6]) +
         __builtin_popcount(fp[7]));

This new version contains no tests and no branches. CPUs like that, and so do compilers.

In my benchmark I know the test fingerprints contain 1024 bits. I want to see if this idea is useful so I'll write a version which completely unrolls the loop in the block_similarity function which computes 'b' and 'c'. The downside with unrolling is there's a lot more code. I'll work around that a bit by using a macro, but the result is still ugly.

void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
  int a=0, b=0, c=0, i;
  int *query_as_int = (int *) query;
  int *targets_as_int = (int *) targets;
  int num_ints_in_fp = num_bytes / sizeof(int);
  unsigned int fp_word;

  /* precompute popcount(query) */
  for (i=0; i<num_ints_in_fp; i++) {
    fp_word = query_as_int[i];
    a += (_popcount_counts[fp_word       & 0xff] + 
          _popcount_counts[fp_word >>  8 & 0xff] +
          _popcount_counts[fp_word >> 16 & 0xff] +
          _popcount_counts[fp_word >> 24 & 0xff]);
  }

  /* For each fingerprint in the block */
  while (num_targets-- > 0) {
    /* compute popcount(target) and popcount(target&query) */
    b=0; c=0;
    i=0;

#define _B_AND_C_POPCOUNT \
    fp_word = targets_as_int[i];                    \
    b += (_popcount_counts[fp_word       & 0xff] +    \
          _popcount_counts[fp_word >>  8 & 0xff] +    \
          _popcount_counts[fp_word >> 16 & 0xff] +    \
          _popcount_counts[fp_word >> 24 & 0xff]);    \
    fp_word &= query_as_int[i++];                    \
    c += (_popcount_counts[fp_word       & 0xff] +    \
          _popcount_counts[fp_word >>  8 & 0xff] +    \
          _popcount_counts[fp_word >> 16 & 0xff] +    \
          _popcount_counts[fp_word >> 24 & 0xff]);

    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;

    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;

    /* Append the result and go on to the next fingerprint */
    *results = ((double)c)/(a+b-c);
    results++;
    targets_as_int += num_ints_in_fp;
  }
}

Does it make any difference?

1216150 calculations in 0.306 seconds; 3968043 per second
Estimated PubChem time: 4.8 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936

Nice! This is about 25% faster than the previous best time.

16 bit lookup table

There are still more tricks up my sleve. I wondered how much time was spent in computing the popcount for a fingerprint so I removed the calculation of 'b'

    b += (_popcount_counts[fp_word       & 0xff] +    \
          _popcount_counts[fp_word >>  8 & 0xff] +    \
          _popcount_counts[fp_word >> 16 & 0xff] +    \
          _popcount_counts[fp_word >> 24 & 0xff]);    \

from my unwound macro definition. This means 'b' will always be 0, and the scores will be completely wrong. The resulting report was:

1216150 calculations in 0.185 seconds; 6562436 per second
Estimated PubChem time: 2.9 seconds
First few are:
inf
2.41463414634
1.91666666667
2.04347826087
2.11111111111

which means about 40% of the time is spent computing one of the fingerprints. Which makes sense as the entire point is to compute two fingerprints and I'm trying to avoid other overhead.

Can I make those faster? I've been using an 8 bit lookup table. I could cut the number of operations in half by using a 16 bit lookup. That table only takes about 64K of memory, which fits into cache these days. Would that be faster?

The 8 bit lookup table was small enough that I could write it in 10 lines of code. The 16 bit table I did for my own proof-of-concept testing took 2050 lines. I don't want to list that here. Instead, I'll define space for the entire array and an 'init' function in libtanimoto.c. After the Python module, via ctypes, loads the module it will call the initialization function.

It's been a long time since I've shown the full "libtanimoto.c" and "tanimoto.py" files. I've made lots of small changes over that time so it's best to resynchronize. Here are complete copies of each file, with the important new parts in bold:

/* libtanimoto.c */

/* Define space for the 16-bit lookup table */
/* Bootstrap using values for the 8-bit table */
static char _popcount_counts[65536] = {
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};

/* Bootstrap the table.  This will be called by 'tanimoto.py' */
void init_tanimoto(void) {
  int i;
  /* If I know you are using GCC then you could call __builtin_popcount */
  for (i=256; i<65536; i++) {
    _popcount_counts[i] = (_popcount_counts[i & 0xff] +
                           _popcount_counts[i >> 8]);
  }
}

/* Compute the Tanimoto similarity between two fingerprints */
double similarity(int num_bytes,
                  unsigned char *query, unsigned char *target) {
  int a=0, b=0, c=0, i;
  for (i=0; i<num_bytes; i++) {
    a += _popcount_counts[query[i]];
    b += _popcount_counts[target[i]];
    c += _popcount_counts[query[i]&target[i]];
  }
  return ((double)c)/(a+b-c);
}

/* Compute the similarity beteen a query fingerprint and a block
   of target fingerprints.  Result values go into 'results'. */
/* This assumes num_bytes == 32 and that query and targets are int aligned */
void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
  int a=0, b=0, c=0, i;
  int *query_as_int = (int *) query;
  int *targets_as_int = (int *) targets;
  int num_ints_in_fp = num_bytes / sizeof(int);
  unsigned int fp_word;

  /* precompute popcount(query) */
  for (i=0; i<num_ints_in_fp; i++) {
    fp_word = query_as_int[i];
    a += (_popcount_counts[fp_word & 0xffff] + 
          _popcount_counts[fp_word >> 16]);
  }

  /* For each fingerprint in the block */
  while (num_targets-- > 0) {
    /* compute popcount(target) and popcount(target&query) */
    b=0; c=0;
    i=0;

#define _B_AND_C_POPCOUNT    \
    fp_word = targets_as_int[i];                \
    b += (_popcount_counts[fp_word & 0xffff] +  \
          _popcount_counts[fp_word >> 16]);        \
    fp_word &= query_as_int[i++];                \
    c += (_popcount_counts[fp_word & 0xffff] +  \
          _popcount_counts[fp_word >> 16]);

    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;

    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;

    /* Append the result and go on to the next fingerprint */
    *results = ((double)c)/(a+b-c);
    results++;
    targets_as_int += num_ints_in_fp;
  }
}

and the high-level Python interface to it:

# tanimoto.py - wrapper to the libtanimoto shared library

import ctypes, ctypes.util
import numpy

# Load the shared library (the extension depends on the OS type)
libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))


# Call its initialization function
_init_tanimoto = libtanimoto.init_tanimoto
_init_tanimoto.argtypes = []
_init_tanimoto.restype = None
_init_tanimoto()

# Annotate the function call argument and result types
_similarity = libtanimoto.similarity
_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
_similarity.restype = ctypes.c_double

# Define the high-level public interface as a Python function
def similarity(query, target):
    n = len(query)
    m = len(target)
    if n != m:
        raise AssertionError("fingerprints must be the same length")
    return _similarity(n, query, target)

_block_similarity = libtanimoto.block_similarity
_block_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p,
                              ctypes.c_int, ctypes.c_char_p,
                              ctypes.POINTER(ctypes.c_double)]
_block_similarity.restype = None

# The high-level public interface
def block_similarity(query, targets):
    n = len(query)
    m = len(targets)
    if m % n != 0:
        raise AssertionError("length mismatch")
    num_targets = m//n
    result = numpy.zeros(num_targets, numpy.double)
    result_ctypes = ctypes.cast(result.ctypes.data,
                                ctypes.POINTER(ctypes.c_double))
    _block_similarity(n, query, num_targets, targets, result_ctypes)
    return result

When I ran it I found:

1216150 calculations in 0.231 seconds; 5269985 per second
Estimated PubChem time: 3.6 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936

The code is about 25% faster than the previous fastest version. The original byte lookup implementation at the start of this essay took 0.750 seconds, so the code is now about 3 times faster. Although it's not good enough for general use because it's hard-coded to 1024 bit, integer aligned fingerprints.

That's okay. I can add a test for that and if the input meets all the right criteria then call the specialized function, otherwise use the more general purpose but slower function. Not something I'm going to worry about now, but feel free to do it yourself. It might start something like:

void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
    /* ensure the data is all integer aligned */
    if (query % 4 == 0 && targets % 4 == 0) {
      switch(num_bytes) {
        case 128: .. specialized code for 1024 bits ..
        case 64: .. specialized code for 512 bits ..
        case 40: .. specialized code for the MDL MACCS 320 keys ..
        case 32: .. specialized code for 256 bits ..
           .. more special cases ..
       default: break;
      }
    }
     ... general purpose code goes here ..

Read: Faster block Tanimoto calculations

Previous Topic

Next Topic


	Web Artima.com