The Artima Developer Community
Sponsored Link

Python Buzz Forum
14 Nov 2004

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Titus Brown

Posts: 23
Nickname: titus
Registered: Nov, 2004

Titus Brown is a graduate student in biology at Caltech.
14 Nov 2004 Posted: Nov 29, 2004 8:58 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Titus Brown.
Original Post: 14 Nov 2004
Feed Title: Advogato diary for titus
Feed URL: http://advogato.org/person/titus/rss.xml
Feed Description: Advogato diary for titus
Latest Python Buzz Posts
Latest Python Buzz Posts by Titus Brown
Latest Posts From Advogato diary for titus

Advertisement
QOTDE: "It is difficult to make predictions, especially about the future"

The write way to right re-usable bioinformatics tools.

It's frustrating how many fantastic bioinformatics analysis tools exist in a difficult-to-use form. Most of the algorithmically challenging tools I use exist only in command-line form; in fact, I can't think of a single sequence analysis program that has an external API. (I understand the situation may be slightly different in the area of clustering software, but that's not my biz at the moment.) A good external interface for NCBI BLAST or CLUSTALW would have saved me many hours.

It's not only the complex programs that suffer from this lack. One of my favorite whipping dogs is EMBOSS, a collection of many rather small command-line programs that do useful bits of analysis. They have tons of stuff, covering most anything you need to do in sequence analysis, but it's all locked away behind formats and stdin/stdout, and much of it is simply easier to re-write if you don't know how to use the program in the first place. In fact, I bet that over 90% of the programs in EMBOSS could be re-written from scratch in little more than a weekend using the scripting language of your choice (abbr "Python").

This is not an entirely idle contention; I rewrote part of fuzztran a few months ago. It took me 30 minutes -- not because I'm a fantastic programmer, but because I had a pattern-searching library that solved a more general problem. Here's what happened:

fuzztran uses a pattern language to search a database of nucleotide sequences after translation. It's useful in situations where you have a leetle bit of protein sequence -- say, from some Edman sequencing -- and want to search a genome or mRNA library for a match. This was exactly the situation I was in, but I needed to search a rather large library containing over 5 million sequences from a whole-genome shotgun sequencing effort on the sea urchin. Moreover, I needed to do an intersection of the results: I wanted to search for two substrings in proximity to each other.

I trundled on over to EMBOSS, read the fuzztran documentation, and tried running it. I immediately ran into several difficulties: it wasn't particularly fast; I didn't know if it was actually working, or if I had entered things in the wrong format; it didn't permit "percent mismatches", as in "find me sequences that match at the 90% level"; it was annoying to script; and the output format wasn't easily parseable.

Ergh.

I spent about 20 minutes trying to find an easy way to use the thing and finally decided that my time was better spent writing a specific tool for my needs. I ended up using my motility toolkit, which supports fuzzy pattern searching with position-weight matrices. I wrote a quick function to reverse-translate amino acids into codons, and thence into a position-weight matrix; once I had this "translate_protein_to_PWM" function written, the final code was very short:

for prot in protein_list:
    matrix = translate_protein_to_PWM(prot)
    length = len(matrix)
    pwm = motility.PWM(matrix)

# allow % mismatches min_score = length - int(float(length) * MISMATCH_PERCENT + 0.5)

print 'searching:', prot for sequence in sequences: if pwm.find(sequence, min_score): # save.

The code, together with testing and debugging, took a total of 30 minutes to write, and worked great -- we found the right protein & went on to verify it experimentally. (The tool is now in my slippy collection under "search-database-for-prot.py".)

Even better, this code was readily extensible to do other things, like mixed protein searches (where you've gotten mixed sequence, e.g. "RYAAGG" and "YGGGAR" were sequenced simultaneously and can't be deconvolved, so you need to search for [RY][YG][AG][AG][GA][RG]") and general domain searches. So that was nice.

OK. Ungapped fuzzy protein sequence searching is, in many senses, a toy problem. There are tons of ways to do it, I'd bet, and none of them would take very long to implement from scratch. The situation is more frustrating when you have to deal with the warts on something like water, which does a Smith-Waterman alignment. This is a moderately tricky piece of code, and reimplementing it isn't a good option for a short-term project. What would be great is if someone broke out the code that did the tricky bits -- the alignment itself -- from the code that worried about parsing input data and constructing output formats. To their credit, the EMBOSS people seem to have done this, but it's in a library that as far as I can tell isn't documented. So it's probably easiest for Joe Blow Bioinformatician to simply use the command-line program, with all of the clumsiness inherent in that approach.

I'd bitch less about the whole problem if it weren't that the EMBOSS folk, and the NCBI folk (who make BLAST), are paid for software development. As mjg59 points out, most analysis programs are written on research grants, where the short-term view outstrips the long-term view. Not so for EMBOSS, who apparently has a whole team of people writing this stuff. I just don't get it; Perl and Python are perfectly good scripting languages, and they're cross-platform; surely it would be easier to just provide a good embedding of the algorithmically challenging functions and then just write the individual programs as scripts??

O well. Some day I hope to rewrite BLAST and retool CLUSTALW to support a nice library API. 'til then, I guess I'll just gripe about the general problem here ;0).

--titus

Read: 14 Nov 2004

Topic: WayTech ActionMail Previous Topic   Next Topic Topic: Wax 0.2.39 released

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use