This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Combinitorial Library Generation with SMILES
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Someone recently asked me how to generate a combinitorial library
given a set of fragments.
For the non-chemist readers, combinitorial chemistry uses a core
structure and reactions that can attach fragments at a given point to
the core. This lets chemists search a structure family to find a
compound that's is "better" in a chemistry space with dimensions
including effectiveness, toxicity, digestability, and ability to reach
the right part of the body. (This is for pharmaceutical chemistry;
combinitorial chemistry can be used for other domains.)
A core may have 1, 2, or more fragment attachment points so many new
compounds can be created with this technique. Companies use robots to
generate the new compounds and test them against the target, which
might be a protein or cell. There can be well over 100,000 tests in
an assay. I've worked with a couple companies to develop tools that
help the scientists better understand these sorts of data sets.
To limit the number of compounds created, many people will generate
virtual libraries and use software to pick the compounds that will be
tested via the robots. If the software was good we wouldn't need the
robots. We've a long way to go.
The email asked if any software is available to generate the virtual
libraries. He had been using SMILES strings for the core and
fragments and simply concatenating them together. This doesn't work
because that allows at most two attachment points on the core. One
for the front and one for the back of the SMILES string.
The easiest way to do this is with ring closures. Suppose the core
structure is O1CNCCC1 with attachment points on the 3nd
and 5th atoms (the N and the third C) shown in bold. Pick very high
ring closure numbers not seen in real life, like 90 and 91 and add
them to the appropriate atoms. The '%' is needed in SMILES for
closure numbers greater than 9.
The result is O1CN%90CC%91C1.
Use the same sort of trick to label the fragments. Suppose a fragment
is OC=CC=C- and the terminal carbon (the "C-") is to be
attached to the nitrogen. The ring closure number for the N is 90 so
label the terminal carbon the same, as OC=CC=C%90. To make it
easier on me, assume a methyl is attached at the core's C attachment
point labeled 91. The corresponding fragment in SMILES is C%91.
To make it all work, concatenate the three strings using the dot
disconnect character. The result is
O1CN%90CC%91C1.OC=CC=C%90.C%91
That's all that's required. When the SMILES parses puts the molecule
together it matches the two %90 and the two %91 ring closures to
stitch the three parts together.
The dot disconnect only says there isn't an implicit bond between the
atoms on either side of it. It doesn't mean that the two atoms can't
be covalently bonded through ring closures or must be parts of
different connected subgraphs. (That's another way of saying
"covalent bonded molecules")
The same fragment library might be used for two different fragment
points. Because the '%' character only occurs in SMILES before a two
digit ring closure you can label all your fragment terminals with,
say, "%99" and use simple text substitution as needed for the given
core attachement point.
Make sure the bond types match across the ring closure. C%1.C%1 and
C%1.C-%1 are the same as CC and C=%1.C1 is the same as CC, but
C=%1.C-%1 is illegal because the two explicit bond types conflict.
You'll need to be even more careful with chiral bonds to make sure the
order of the core and fragments is correct.
It's very cool that a text editor and a couple shell commands are all
that's needed to make a virtual library using SMILES.