This post originated from an RSS feed registered with Python Buzz
by Titus Brown.
Original Post: 14 Nov 2004
Feed Title: Advogato diary for titus
Feed URL: http://advogato.org/person/titus/rss.xml
Feed Description: Advogato diary for titus
The write way to right re-usable bioinformatics tools.
It's frustrating how many fantastic bioinformatics analysis tools
exist in a difficult-to-use form. Most of the algorithmically
challenging tools I use exist only in command-line form; in fact,
I can't think of a single sequence analysis program that has an
external API. (I understand the situation may be slightly different
in the area of clustering software, but that's not my biz at the
moment.) A good external interface for NCBI BLAST or CLUSTALW would
have saved me many hours.
It's not only the complex programs that suffer from this lack. One of
my favorite whipping dogs is EMBOSS, a
collection of many rather small command-line programs that do useful
bits of analysis. They have tons of stuff, covering most anything you
need to do in sequence analysis, but it's all locked away behind
formats and stdin/stdout, and much of it is simply easier to re-write
if you don't know how to use the program in the first place. In fact,
I bet that over 90% of the programs in EMBOSS could be re-written from
scratch in little more than a weekend using the scripting language of
your choice (abbr "Python").
This is not an entirely idle contention; I rewrote part of fuzztran
a few months ago. It took me 30 minutes -- not because I'm a
fantastic programmer, but because I had a pattern-searching library
that solved a more general problem. Here's what happened:
fuzztran uses a pattern language to search a database of nucleotide
sequences after translation. It's useful in situations where you have
a leetle bit of protein sequence -- say, from some Edman sequencing --
and want to search a genome or mRNA library for a match. This was
exactly the situation I was in, but I needed to search a rather
large library containing over 5 million sequences from a whole-genome
shotgun sequencing effort on the sea urchin. Moreover, I needed
to do an intersection of the results: I wanted to search for two
substrings in proximity to each other.
I trundled on over to EMBOSS, read the fuzztran documentation, and
tried running it. I immediately ran into several difficulties: it
wasn't particularly fast; I didn't know if it was actually working, or
if I had entered things in the wrong format; it didn't permit "percent
mismatches", as in "find me sequences that match at the 90% level"; it
was annoying to script; and the output format wasn't easily
parseable.
Ergh.
I spent about 20 minutes trying to find an easy way to use the thing
and finally decided that my time was better spent writing a specific
tool for my needs. I ended up using my motility toolkit,
which supports fuzzy pattern searching with position-weight matrices.
I wrote a quick function to reverse-translate amino acids into codons,
and thence into a position-weight matrix; once I had this "translate_protein_to_PWM" function written, the final code was very short:
for prot in protein_list:
matrix = translate_protein_to_PWM(prot)
length = len(matrix)
pwm = motility.PWM(matrix)
print 'searching:', prot
for sequence in sequences:
if pwm.find(sequence, min_score):
# save.
The code, together with testing and debugging, took a total of 30 minutes
to write, and worked great -- we found the right protein & went on to
verify it experimentally.
(The tool is now in my slippy collection
under "search-database-for-prot.py".)
Even better, this code was readily extensible to do other things, like
mixed protein searches (where you've gotten mixed sequence,
e.g. "RYAAGG" and "YGGGAR" were sequenced simultaneously and can't be
deconvolved, so you need to search for [RY][YG][AG][AG][GA][RG]")
and general domain searches. So that was nice.
OK. Ungapped fuzzy protein sequence searching is, in many senses, a
toy problem. There are tons of ways to do it, I'd bet, and none of
them would take very long to implement from scratch. The situation is
more frustrating when you have to deal with the warts on something
like water,
which does a Smith-Waterman alignment. This is a moderately tricky
piece of code, and reimplementing it isn't a good option for a
short-term project. What would be great is if someone broke out the
code that did the tricky bits -- the alignment itself -- from the code
that worried about parsing input data and constructing output formats.
To their credit, the EMBOSS people seem to have done this, but it's in
a library that as far as I can tell isn't documented. So it's
probably easiest for Joe Blow Bioinformatician to simply use the
command-line program, with all of the clumsiness inherent in that
approach.
I'd bitch less about the whole problem if it weren't that the EMBOSS
folk, and the NCBI folk (who make BLAST), are paid for software
development. As mjg59
points out, most analysis programs are written on research grants,
where the short-term view outstrips the long-term view. Not so for
EMBOSS, who apparently has a whole team of people writing this
stuff. I just don't get it; Perl and Python are perfectly good scripting
languages, and they're cross-platform; surely it would be easier
to just provide a good embedding of the algorithmically challenging
functions and then just write the individual programs as scripts??
O well. Some day I hope to rewrite BLAST and retool CLUSTALW to
support a nice library API. 'til then, I guess I'll just gripe about
the general problem here ;0).