This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: RSS for bioinformatics
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
One of the reasons I wrote
PyRSS2Gen was to experiment
with RSS for data collection in bioinformatics. Last year I
came across
PubCrawler,
which periodically searches PubMed and GenBank and emails you
a summary of new matches to your searches. It's a nice idea,
in part because managing that data yourself is error-prone.
The trend these days is to make that data available through RSS. With
a good RSS client this should be better than email because it can
accumulates all the entries over time, and the trackbacks would let
you make comments about the hits, potentially sharing it with others.
(This is all theory - I haven't used a high-end RSS client.)
During that time I also found a site which did some RSS feeds
for PubMed searches, but I can't find it now. I did come across
HubMed and
my.PubMed which
do have RSS feeds. (I tested both to find one of my papers;
search for "dalke" with a refinement of "tcl". I found HubMed
the easier of the two. It wasn't obvious how to refine a
search in my.PubMed.)
In theory, a lot of searches could have RSS front-ends. What about a
BLAST job run every week, where the RSS feed tells you about the new
matches? What about an annotation system where you can comment on
regions of a sequence and let others know about it.
(DAS does some of that, but I
would like it to integrate with other non-biology tools. I think
it's close, and something to consider for DAS2.) And how does
PIE's editing features fit into all this?
There's a few prerequisites to doing this. The first is a way to
automate PubMed, GenBank, BLAST, and other searches. Biopython, bioperl and the other Bio* projects
all do this to some extent, though I think our EUtils code contributed to Biopython is the most
powerful. The second is support for RSS generation; not a hard task,
but there are still a lot of incorrectly formatted RSS feeds, so we
developed PyRSS2Gen.
The third is time and money, since there are too many interesting
things to work on and too many bills to pay. And the last is access
to end-users, which is essential to know that what we're doing makes
sense in the real-world.
All of our clients the last few years have been chemists, not
biologists. Chemists also do searches, but it's a bit different than
in biology. There isn't anywhere near the amount of public data for
chemistry as there is for biology. There is ACD and the other
commercial databases, but very few are freely available, and I'm told
that those databases are only a small fraction of the data locked up
in the various pharmas and other chemistry companies. And outside of
conferences it's rare for people to talk to people in other companies
about their research. Even in conferences its often highly vetted by
the laywers.
This means most of the data systems are local, with a larger diversity
of servers. Any software must know how to integrate with Thor/Merlin,
Isis, Unity, local Oracle schemas, and whatever else might be hanging
around. Since relatively little new data comes into the company
compared to in-house generated data, it's often easier for one
researcher to talk to the person doing the experiment instead of going
through a computer system. Only in the large pharmas will you start
approximating the problems that RSS and PubCrawler resolve.
This is another project we would enjoy working on, so if you are
interested in funding us, let us know!