The Artima Developer Community
Sponsored Link

Python Buzz Forum
SMILES tokens

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.
SMILES tokens Posted: Apr 12, 2004 5:16 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: SMILES tokens
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Latest Python Buzz Posts
Latest Python Buzz Posts by Andrew Dalke
Latest Posts From Andrew Dalke's writings

Advertisement
Okay, so you're a hard-core geek and you want to know how write your own chemical informatics toolkit, so that you too will get fun, fame, fortune, and fem----... err, MOTAS. As far as I know, there is no documentation or textbook on this topic. The chemical informatics books I've seen all talk about the science of how the fields work, not the software, and of course almost no computer science book goes into chemistry, other than some mention that graph theory can be applied to the valence bond model of molecules. Let me be your trusty guide on through these uncharted electrons.

I'll start with SMILES since it's such a pretty, well-defined nomenclature. Eventually I'll describe SD and perhaps mol2 files, and a bit of the agony that is the PDB. I'll assume you know what SMILES is. If not, Daylight has a plenty of documentation.

If you have a compound which falls into Daylight's model of chemistry (ie, covalent bonds, in a ground state, etc.) then it can be represented as a SMILES string, or alternatively you can say that it's represented in SMILES. A SMILES string can be broken down into smaller terms; atom, bond, open branch, close branch, ring closure, and dot disconnect. To make the enumeration easier to understand, I'll separate the atom term into "element" and "atom", where the first is an atom in the organic subset and may be written without square brackets.

  • element: one of 'c', 'n', 'o', 's', 'p', 'B', 'C', 'N', 'O', 'F', 'P', 'S', 'I', 'Cl', or 'Br'.
  • atom: of the form '[' mass? symbol chiral? hcount? charge? ']' where the '?' means the given component is optional and where
    • mass: a non-negative integer
    • symbol: one of '*', 'H', 'He', 'Li', 'Be', .... the element symbols
    • chiral: one of the chiral symbols (which I won't list right now)
    • hcount: a non-negative integer
    • charge: the atomic charge, written as '+' followed by either a non-negative integer or 0 or more '+'s, or a '-' followed either a non-negative integer or zero or more '-'s.
  • bond: one of '=', '#', '/', '\', ':', '~', or '-'.
  • open branch: the character '('
  • close branch: the character ')'
  • ring closure: either a single digit (including 0, but don't use a 0 when generating a SMILES string) or '%' followed by two digits (and note that %09 and 9 represent the same ring closure).
  • dot disconnect: the character '.'

What I've done here is break a SMILES string into its smallest parts. In linguistics these parts are called morphemes (or is that lexemes?). In computer science these are called tokens. Here are some SMILES strings and their tokens.

CCO element: 'C'
element: 'C'
element: 'O'
CC(=O)O element: 'C'
element: 'C'
open branch: '('
bond: '='
element: 'O'
close branch: ')'
element: 'O'
[Na+].[Cl-] atom: '['
  symbol: 'Na'
  charge: '+'
  ']'
dot disconnect: '.'
atom: '['
  symbol: 'Cl'
  charge: '-'
  ']'
[235U] atom: '['
  mass: '235'
  symbol 'U'
  ']'

Tokens are not randomly placed in a SMILES string. There is a pattern to how these tokens are arranged; an atom can follow another atom but a bond cannot follow another bond, and a close branch cannot follow an open branch. With some work it's possible to build a table listing which terms can and cannot follow another.

  atom bond open branch close branch ring closure dot disconnect
start C no no (see below) no no no (see below)
atom CC C=C C(=O)[O-] C(=O)[O-] c1ccccc1 C.C
bond C=C no no no C=1CCC=1 no
open branch C(C) C(=C) no no no C(.C)
close branch C(C)C C(C)=C C(C)(C)C C(C(C))C C(C)1ONON1 C(C).C
ring closure C1CCC1 C1=CCC1 C1(CC)CC1 C1C(CC)1 C12CC1C2 c1ccccc1.C
dot disconnect C.C no no no no no (see below)

Notes:

  • Daylight's old documentation allows (OC)C as a valid SMILES but their implementation did not handle it. The new documentation (changed in late 2003, I think) doesn't mention that case. OpenEye does handle it, except that when I tested it in Sept. 2003 it didn't seem to handle chirality correctly with cases like (C/F)=C/F. I reported it but haven't tested it with a newer version of OEChem. In any case, no program should ever generate a SMILES like that (since Daylight will call it an error) so I recommend not handling it.
  • Daylight doesn't accept anything which starts with a dot disconnect (as ".C") nor anything with two disconnects in a row (as "C..C"). OpenEye does accept them under its permissive parser but not its strict one. For now I'll be strict and say that they aren't allowed.

Even this isn't enough to describe all valid SMILES string. For examples, "C)O" is allowed with the given rules, even though it obviously makes no real sense, and "ccC", while also legal, makes no chemical sense because the aromatic carbons must be in a ring. Still, there's a lot that can be done with just simple tokenization. The next step is to parse the token stream.

Read: SMILES tokens

Topic: ZServerSSL, Windows Service Previous Topic   Next Topic Topic: GAO: PKI spending hits $1 billion

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use