This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Naming novel molecules
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
There you are, having fun with your NMR machine, determining
structures left and right, using CAS to get a compound id, and seeing
what the world knows about it. You've set up a local database around
the CAS# which tracks how you made a compound and any interesting
information about it.
One day though, you're playing around with your plasma vapor
deposition equipment (remember, you're a physicist; chemical reagents
are icky and dangerous) and your mass spec gets a new signal. You
isolate it, get the structure, and, ... well look at that, CAS doesn't
know about it. You've created a previously unknown molecule! (Or
previously unpublished. For all you know, some big pharma could apply
for a patent on it tomorrow.)
Now you've got a problem. How do you stick it in your database? It
uses the CAS# as the primary key, and that's the flaw. If you had
taken the database class back in school instead of field theory, you
would have learned that your primary keys shouldn't depend on someone
else. (Well, unless you have some very good certainty about its
appropriateness, and even then you should be wary.)
A solution is to provide your own naming service. You can use CAS for
the compounds they've identified and your own unique identifiers for
those not in CAS, but someday your compounds might be in CAS so you'll
have a duplicate name. So the best is just to give each unique
compound your own indentifier, and include an optional CAS# for those
which are in CAS.
(There are well-known problems you should be aware of in your quest to
put molecular information in a database. The structure could have
been misidentified, so there needs to be a way to handle corrections.
The compound may be in one of several tautomeric forms, or described
in one of a couple different ways for handling stereochemistry. You
can rediscover them from scratch if you want, or pay
experienced people to help
you out. The process of getting information about a compound into a
standard form for database entry is called registration.)
(You might think you could just publish the compound and get a CAS#
for it, but I think you need to characterize more about the compound
than just the structure. Even if that's not the case, the
combinitorial chemists add another complication. They start with a
core template and have ways to stick almost any side group off one or
more of the atoms in that core. They can easily make any one of an
essentially an unlimited number of compounds. CAS won't give them an
infinite number of identifiers, so how do you ask them to make a
specific compound for you? And of course if you're trying to make the
next blockbuster drug, you don't want to publish it until after you've
applied for a patent.)
You fixed that problem in your database. All your compounds have a
unique name you assigned to them. You had to implement your own graph
isomorphism search program to ensure that the new compounds were
unique, but that was a fun bit of programming. Then one evening
you're out salsa dancing and meet a physical chemist who is studying
the chemistry of plasma vapor deposition. You're curious to know if
they know about the novel compound you made so she asks you to email
her the structure. (Score! You now know her email address. BTW,
since you're a physicist you're almost certainly
a male.)
What do you send her? There's no CAS#, and she doesn't know about
your identifiers. You could send the IUPAC name, unless it's one of
those compounds which can't be named under IUPAC rules. Or you could
send the chemical graph, either as an image (using the visual
depiction language of chemists) or as some standard graph data
structure (listing all the atoms and bonds and their types, charges,
chirality, etc.) The last of these is the most common because it
always works and it means the receiver doesn't need to sketch the
structure back into the computer. The most popular of the connection
table ("CT") formats is the SD file or molfile, from
MDL. You register with their
free download service, get the file format definition, and send
her your compound's structure.
This works, but something feels wrong. The molecular graph
unambiguously describes a compound. Why can't it be used as an unique
identifier in its own right? Well, besides that the order of atoms
and bonds is arbitrary. And besides that MDL's connection table is
verbose and takes about 60 bytes per atom and 15 bytes per bond.
(That can easily be shrunk; their CT stores coordinates, which isn't
needed for a graph.)
If only there was some way to represent any chemical graph as a single
"word" such that 1) it could be stored on a line in a file and easily
imported into a cell of Excel, 2) all isomorphic graphs are mapped to
the same word, 3) the word is unique, so that no non-isomorphic graphs
are mapped to the same word. Which takes you back to what you wanted
originally -- a unique, unambiguous name for every compound.