This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: SMILES tokens
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Okay, so you're a hard-core geek and you want to know how write your
own chemical informatics toolkit, so that you too will get fun, fame,
fortune, and fem----... err, MOTAS. As far as I know, there is no
documentation or textbook on this topic. The chemical informatics
books I've seen all talk about the science of how the fields work, not
the software, and of course almost no computer science book goes into
chemistry, other than some mention that graph theory can be applied to
the valence bond model of molecules. Let me be your trusty guide on
through these uncharted electrons.
I'll start with SMILES since it's such a pretty, well-defined
nomenclature. Eventually I'll describe SD and perhaps mol2 files, and
a bit of the agony that is the PDB. I'll assume you know what SMILES
is. If not, Daylight has a plenty
of documentation.
If you have a compound which falls into Daylight's model of chemistry
(ie, covalent bonds, in a ground state, etc.) then it can be
represented as a SMILES string, or alternatively you can say
that it's represented in SMILES. A SMILES string can be broken
down into smaller terms; atom, bond, open branch, close branch, ring
closure, and dot disconnect. To make the enumeration easier to
understand, I'll separate the atom term into "element" and "atom",
where the first is an atom in the organic subset and may be written
without square brackets.
element: one of 'c', 'n', 'o', 's', 'p', 'B', 'C', 'N', 'O', 'F',
'P', 'S', 'I', 'Cl', or 'Br'.
atom: of the form '[' mass? symbol chiral? hcount? charge? ']'
where the '?' means the given component is optional and where
mass: a non-negative integer
symbol: one of '*', 'H', 'He', 'Li', 'Be', .... the element symbols
chiral: one of the chiral symbols (which I won't list right now)
hcount: a non-negative integer
charge: the atomic charge, written as '+' followed by either a
non-negative integer or 0 or more '+'s, or a '-' followed either
a non-negative integer or zero or more '-'s.
bond: one of '=', '#', '/', '\', ':', '~', or '-'.
open branch: the character '('
close branch: the character ')'
ring closure: either a single digit (including 0, but don't use a
0 when generating a SMILES string) or '%' followed by two digits (and
note that %09 and 9 represent the same ring closure).
dot disconnect: the character '.'
What I've done here is break a SMILES string into its smallest parts.
In linguistics these parts are called morphemes (or is that
lexemes?). In computer science these are called tokens.
Here are some SMILES strings and their tokens.
CCO
element: 'C'
element: 'C'
element: 'O'
CC(=O)O
element: 'C'
element: 'C'
open branch: '('
bond: '='
element: 'O'
close branch: ')'
element: 'O'
Tokens are not randomly placed in a SMILES string. There is a pattern
to how these tokens are arranged; an atom can follow another atom but
a bond cannot follow another bond, and a close branch cannot follow an
open branch. With some work it's possible to build a table listing
which terms can and cannot follow another.
atom
bond
open branch
close branch
ring closure
dot disconnect
start
C
no
no (see below)
no
no
no (see below)
atom
CC
C=C
C(=O)[O-]
C(=O)[O-]
c1ccccc1
C.C
bond
C=C
no
no
no
C=1CCC=1
no
open branch
C(C)
C(=C)
no
no
no
C(.C)
close branch
C(C)C
C(C)=C
C(C)(C)C
C(C(C))C
C(C)1ONON1
C(C).C
ring closure
C1CCC1
C1=CCC1
C1(CC)CC1
C1C(CC)1
C12CC1C2
c1ccccc1.C
dot disconnect
C.C
no
no
no
no
no (see below)
Notes:
Daylight's old documentation allows (OC)C as a valid SMILES but
their implementation did not handle it. The new documentation
(changed in late 2003, I think) doesn't mention that case. OpenEye
does handle it, except that when I tested it in Sept. 2003 it didn't
seem to handle chirality correctly with cases like (C/F)=C/F. I
reported it but haven't tested it with a newer version of OEChem. In
any case, no program should ever generate a SMILES like that (since
Daylight will call it an error) so I recommend not handling it.
Daylight doesn't accept anything which starts with a dot
disconnect (as ".C") nor anything with two disconnects in a row
(as "C..C"). OpenEye does accept them under its permissive
parser but not its strict one. For now I'll be strict and say
that they aren't allowed.
Even this isn't enough to describe all valid SMILES string. For
examples, "C)O" is allowed with the given rules, even though it
obviously makes no real sense, and "ccC", while also legal, makes no
chemical sense because the aromatic carbons must be in a ring. Still,
there's a lot that can be done with just simple tokenization. The
next step is to parse the token stream.