This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Chemical Informatics Toolkits
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Now that you know some of the history of chemical nomenclature, how do
you actually use one of these molecular representations to do science?
The easiest way is to get existing software which does what you want.
For example, you can buy an integrated set of tools from Tripos, MDL,
or Accelrys. They can be used for many of the tasks needed for
chemical research, and they are extensible and scriptable so that new
capabilities can be added by customers.
There are limits to the extensibility. These programs were designed
as applications, with the assumption that that application will always
be in charge. Suppose though that you want to write an Excel add-in
which uses parts of Sybyl to compute molecular properties. That can't
be done by a customer because Sybyl can't be used as a library that
way. (Or at least not that I know of.) You're somewhat stuck, and
your add-in will need to use a workaround like starting sybyl without
a GUI and passing it an SPL script.
If you are writing a new application, or new plug-in, or need direct
access to the molecule's data structure, or writing an algorithm which
otherwise exceeds the limitations of these applications (or if you're
a programming geek who prefers are "real" programming language) then
you'll need to look at the available toolkits. These are software
libraries which are used as part of a larger system instead of the
other way around. You can buy some from a commercial vendor, like
Daylight (and
PyDaylight,
my Python API to it) or
OpenEye, or get one of the
open source variants, like
Frowns,
Open Babel, or
JOELib.
As an interesting note, chemical informatics is a small field and
these are all interrelated projects. The main parent is Daylight,
which started in the 1980s with the family Weininger at Pomona
College. I started PyDaylight at Bioreason in 1998. Brian Kelley
worked with me on it. He wrote Frowns after leaving Bioreason and
based its API on PyDaylight. Matt Stahl and Pat Walters wrote Babel
in the mid-1990s at the University of Arizona as a molecular converter
program. They use that code for various projects, and when Matt
started at OpenEye in 1999 (or late 1998?) they decided that OpenEye
would develop an open source version based on Babel, which became
OELib. After a few years, they decided that a rewrite was
in order and that the new version, called OEChem, would be closed
source. This is OEChem. OELib got picked up by others and turned
into Open Babel. JOELib is a Java OELib and modeled on
that library.
So where are the ties between the Daylight and Babel derived threads?
Daylight, OpenEye, and Bioreason are all located in Santa Fe, NM.
Dave Weininger over at Daylight encouraged Anthony Nichols to start
OpenEye, and Anthony liked the Santa Fe area. Part of the
encouragement was, I think, to show that a company could make a living
from selling a set of chemistry-oriented toolkits. Roger Sayle (of
RasMol fame) was a VP at Metaphorics, a sister company to Daylight. C
programmer that he is, he also helped out a lot with the Daylight
toolkits and provided some algorithm suggestions for OELib, and I
think he contributed some SMILES and SMARTS parsing code he was using
for some of his own projects. (Given the different edge cases in the
two code bases, I know Daylight and OELib uses different parsers
:)). In mid-2000 he decided to work down the street, as it
were, with OpenEye. In addition, Pat at Vertex used the Daylight
toolkit for projects while also contributing code and support to
OpenEye. And there's me, who uses both toolkits and submits obscure
bug reports, to the combined thankfulness and annoyance of both
companies.
Are there others? Certainly, although I don't know how accessible
they are as toolkits. CACTVS must have some chemical informatics
libraries and I know people have paid Wolf-D. Ihlenfeldt for some of
the components, but I don't think it has a low-level API for all those
libraries. Pipeline Pilot has SMILES and SMARTS cababilities but are
accessed through as SOAP server calls rather than through a more
traditional library interface.
In addition to the publically known projects, many companies have
in-house projects. In my research for the essays on nomenclature
history I read that many of the large pharmas wrote their own systems
in the mid-1900s and I've heard that some of those still exist. I
consulted with Combichem and helped them with their in-house chemical
informatics toolkit. Combichem was bought first by DuPont Pharma then
eventually resold to Deltagen, which proceeded to go under. People
who left Combichem convinced their new employers to buy a copy of that
codebase.