This post originated from an RSS feed registered with Python Buzz
by Titus Brown.
Original Post: 12 Nov 2004
Feed Title: Advogato diary for titus
Feed URL: http://advogato.org/person/titus/rss.xml
Feed Description: Advogato diary for titus
QOTDE: "The lessons of history teach us -- if the lessons of history
teach us anything -- that nobody learns the lessons that history
teaches us." (R. Heinlein)
Use Python -- or a language like it. Plus, my savage hatred of
"system()"
Hey, look -- a
fan! Matthew, dontcha know that the best way to defeat trolls is
to ignore them? Or was that giant advertising animatroids? I forget.
(<-- gratuitous Simpson's reference.)
Quite apart from my drug problems (acid freak, not crackhead --
there's a difference!) and the gratuitous misreference to GUI
programming (I agree completely! I hate GUIs even more than I hate
command-line programs -- they're just useful, on occasion!) and
the unfortunate failure of my former coauthors -- the swinish
bastards! -- to recognize my contributions to the deep foundations of
every paper on Avida, I have to agree that any statement
recommending, say, Python over Perl, APL, Pascal, or COBOL as a
solution is likely to be at best disingenuous and at worst just plain
wrong. It is well-known that any Turing-complete language
(given infinite memory, yada yada) can emulate any other -- so why
choose between them?
Dunno. But, repetitive as it may be to say it, I think
a large part of the solution to bad scientific programming is to use a
language like Python. Seriously, I'm perfectly aware that Lincoln
Stein (and likely Matthew Garrett) can kick my ass when it comes to a
mano-y-mano, Perl-y-Python scripting contest. I'm even reasonably
confident that Lincoln Stein could take me down in person; he looks
mean. (I haven't met Matthew.) But to cite an N of 1 ("worked for
me!") as an actual argument... well, I'm no math major but it seems
like a large std deviation.
An argument that I might make, were I still slavishly and unreasonably
devoted to Perl rather than to Python, would be to point out that
anyone writing C extensions for Perl by hand without using SWIG
and/or XSAPI probably has bigger problems than over-frequent enjoyment
of a little crack. If that's the big problem with Perl, then it's
not a problem at all.
This argument ignores the value of writing pseudocode instead of line
noise, but that seems to be a personal preference rather than an absolute,
for some reason...
And (seriously) Matt's point that this is a social problem is entirely
correct. Teaching people Python at an earlier age might help there. ;)
...why "system()" sucks.
But let's move on to a different argument: my savage hatred of
"system()". Do an experiment: try
writing a parser for the "generic"
GFF format.
What, you say? That's easy? Sure is -- for each and every one of the
bajillion programs that output GFF, it's easy! Now, let's see which
field(s) they overloaded this time...
The problem, to put it bluntly, is formats. In information theoretic
terms, stdout is often a very lossy channel, and it is difficult (and
often impossible) to make it 100% clean. Why? Well, suppose someone
gives you some brilliantly written (and novel) standalone piece of
code, and it takes in sequences in FASTA format together with a couple
of parameters. Now the program does some fantastically complex set of
calculations -- gene finding, HMM search, Gibbs sampling, sequence
alignment -- and spits out some text as a result. That's right --
some text. What does the text mean? At this point the hapless user
of a novel program has several options. S/he can:
write a one-off parser that grabs the necessary data and runs.
write a complete parser that parses all of the output and puts it
into a nice structure for later use.
hope like hell that the
author of the program provided a "standard" format like GFF that
captures some significant component of the output.
wait for
someone more anal retentive (or needier, or smarter, or
harder-working) to write a really good parser for the format.
Libraries like BioPerl or BioPython give you #3 and #4 (with time).
#2 takes a lot of effort and is only worth it when you really need
all of the info in the output. #1 is what everybody does, in practice,
right up until it bites 'em in the butt.
There's one huuuuuge problem with all of this, however: you're at the mercy
of the author of the package to provide full, honest information in the
output. Well, good luck with that, and have a good time rewriting your
parser when Joe Package Author decides that semicolons are a better
divider than commas...
It should be obvious that the best solutions above (#2/#4) can only
ever be as good as a good embedding of the package in your SLOC
(Scripting Language Of Choice). And, far too often, the actual
parsing solution isn't that good, and can't be extended without
breaking everybody else's parsers. That's why command-line
executables with no associated library or embedding will, to a general
and somewhat loose approximation, always suck.
So, people: use Python. Or COBOL. And write library functions
loosely wrapped in main()s, not deeply embedded spaghetti code.
--titus
The shoutout today goes
gnutizen,
who obviously has his own drug issues; he certified me as "Journeyer"!
p.s. It turns out I was a math major. Huh. Weird.
p.p.s. If someone
with some Perl and C/C++ knowledge were to go comment on my SWIG/Perl
embedding of motility
(see the CVS) it could be most useful to me. Just a thought.
p.p.p.s. In the bioinformatics language wars, I have to say that Bioconductor really takes the cake in the "absurdity" category. I personally like R, but why someone would choose it over a more mainstream language for general-purpose programming <shakes head>...