This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Fun with SMILES I: Does an element exist?
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I first learned about SMILES
in 1998, when I went to work for Bioreason and used the Daylight
toolkit as part of a system to apply machine learning methods to high-throughput
screening. Yet even after 18 years, I continue to find new ways to
work with SMILES. I recently came up with methods which I haven't
elsewhere, so figured I would share those discoveries. Then I started
adding more examples. This is part 1.
Check for an element using a substring search
I like to work with SMILES at the syntax level. This is a light-weight
approach based on string processing. A heavy-weight approach is to use
a cheminformatics toolkit
which can interpret the full SMILES chemistry model.
For example, if you have a set of valid SMILES strings and you want to
select only those which contain at least one bromine atom, then the
light-weight approach is to find the strings which contain the
substring "Br" while the heavy-weight approach is to parse the
string into a molecule object, iterate over the atoms, and select the
molecule if one of the atoms is a bromine.
def lightweight_has_bromine_atom(smiles):
return "Br" in smiles
from rdkit import Chem
def heavyweight_has_bromine_atom(smiles):
mol = Chem.MolFromSmiles(smiles)
if mol is None:
# SMILES could not be parsed
return False
return any(atom for atom in mol.GetAtoms() if atom.GetAtomicNum() == 35)
The light-weight approach works because a valid SMILES string contains
the two characters "Br" if and only if the structure has a
bromine.
Test the 'Br' search code
It's easy to make a subtle mistake with light-weight methods, so it's
extra important to include good tests, and preferably also use a
cross-comparison with a heavy-weight equivalent to verify that the
tests are correct. Here is a simple test case for the above functions:
# Pass in the function to test
def _test_has_bromine_atom(has_bromine):
for smiles in (
"Br",
"[Br]",
"[80Br]",
"CCBr",
"BCCBr",
"B[Br]",
):
assert has_bromine(smiles), smiles
for smiles in (
"B",
"c1ccccc1B",
"B[Cr]",
):
assert not has_bromine(smiles), smiles
# Test the light-weight and heavy-weight functions
def test_has_bromine_atom():
_test_has_bromine_atom(lightweight_has_bromine_atom)
_test_has_bromine_atom(heavyweight_has_bromine_atom)
(Once you've cross-verified that the test cases are correct, you can
often remove away the heavy-weight versions that were part of the
bootstrap process.)
Why use a light-weight method?
You might use a light-weight approach like this because string
processing is much faster than fully processing a molecule. It's fast
enough that it can even be used as part
of substructure screening before doing the full atom-by-atom
substructure test.
For an example that came up recently in one of my projects, the user
can skip any structures with more than N non-hydrogen atoms,
typically around 50, and structures with fewer then R rings,
typically 1. If a couple of simple requirements about the SMILES
format are met, then it's possible to do both tests before even
touching the toolkit, which can save a lot of time if many structures
can be skipped.
In future essays I'll describe how to do those two tests.
The heavy-weight approach has its own advantages. It can detect if an
input like "BrBr" is valid or invalid, it can handle a wider
variety of notations, and it requires less cleverness to use
correctly. Indeed, this entire set of essays is all about "clever"
ways to work on notation, and part of a tradition that goes back to at
least the 1950s when people would do an often appoximate substructure
search directly on the WLN syntax using punched-card machines.
Check for an element using a regular expression
You can test for many other elements using a simple substring search,
like chlorine/"Cl", vanadium/"V" and
mercury/"Hg" because Br, Cl, and Hg and many other element
symbols exist if and only if the given element is part of the
structure.
Some elements can't be matched by a simple substring test, either
because the element symbol is a substring of a larger element symbol
or because there are other ways for the characters of the element
symbol to appear in a SMILES string. These can sometimes be handled
with a regular expression.
Fluorine
You cannot identify fluorine-containing structure by a substring test
for "F", unless you knew there were no iron/"Fe", francium/"Fr",
fermium/"Fm", or flerovium/"Fl" atoms in the SMILES data
set. Realistically, the last three have maximum known half-lives of 22
minutes, 100.5 days, and 5.5 seconds, so you're very likely to know
that your compound collection doesn't contain those structures, but
iron is pretty common.
Instead of using a substring test, a light-weight search might look
for an "F" which either at the end of the string (as in
"CF") or isn't immediately followed by one of letters in
"[erml]". I'll implement this using the regular expression:
F($|[^erml]) # regular expression to test if a SMILES contains a fluorine atom
Here's the full implementation and a small test suite:
import re
def has_fluorine_atom(smiles):
return _f_atom_pat.search(smiles) is not None
def test_has_fluorine_atom():
for smiles in (
"F",
"[F]",
"[19F]",
"[19F-]",
"F[Fe]",
"[Fe]CF",
"Fc1ccccc1",
"c1ccccc1F",
"c1cc(F)ccc1",
):
assert has_fluorine_atom(smiles), smiles
for smiles in (
"[Fe]",
"[Fr]",
"[Fm]",
"[Fl]",
):
assert not has_fluorine_atom(smiles), smiles
You might think you could check for an "F" so long as it isn't
followed by a lowercase letter, such as with "F[^a-z]". That
would match "F", but not "Fe" or "Fr". This does not work because it
would also ignore the "F" in "Fc1ccccc1".
The hard part of using a light-weight approch is figuring out corner
cases like this. In addition to the hand-made tests like the ones you
just saw, I typically cross-compare against a heavy-weight toolkit
using a large, diverse SMILES collection, to help reduce the chances
that I overlooked something.
Scandium
It's almost true that any test for a two letter symbol can be written
as a simple substring test. One of the most notable counter-examples
is scandium.
A search for "Sc" will indeed identify all SMILES with a
scandium atom. It will also match structures like "Sc1ccncc1"
where a sulfer is attached to an aromatic carbon. Most data sets
contain many more organosulfurs than scandium-containing molecules, so
a subsearch for "Sc" mostly contain false positives.
Instead, a scandium search will need to test for "Sc" inside
of square brackets. According to OpenSMILES
the content of a "bracket_atom"
is:
bracket_atom ::= '[' isotope? symbol chiral? hcount? charge? class? ']'
(I use the OpenSMILES definition instead of the Daylight SMILES
definition because the latter is less strict about the order of the
modifiers after the atom symbol. Daylight SMILES allows both
"[CH+]" and "[C+H]" while OpenSMILES only allows the
first. There is near universal agreement among SMILES generation tools
to produce only the first form, and the restriction is both less
ambiguous and less error prone (what does "[C+H-]" mean?) and
easier to reason with at the syntax level.)
I can match a scandium by looking for the character "["
followed by an optional isotope number, followed by
"Sc". There's no need to match the rest of the bracket atom
term, so I'll use the following regular expression:
\[[0-9]*Sc # regular expression to test if a SMILES contains a scandium atom
which leads to the following implementation and test:
_scandium_atom_pat = re.compile(r"\[[0-9]*Sc")
def has_scandium_atom(smiles):
return _scandium_atom_pat.search(smiles) is not None
def test_has_scandium_atom():
for smiles in (
"[Sc]",
"C[Sc]",
"[Sc]C",
"[45Sc]",
"[45ScH+]",
"[46Sc-3]",
"Cl[Scl](Cl)Cl",
):
assert has_scandium_atom(smiles), smiles
for smiles in (
"Sc1ccccc1",
"[S]c1ccccc1",
"[S]C",
"[Sb]C",
):
assert not has_scandium_atom(smiles), smiles
I used scandium as the example because it's likely to give a large
number of false positives. I could have given similar examples with
aromatic boron, because lead/"Pb" looks the same as a
phosphorous/"P" connected to an aromatic boron/"b",
and niobium/"Nb" looks the same as a nitrogen/"N"
connected to an aromatic boron.
If you don't want any false positives for Pb or Nb, you'll have to use
a regular expression like "\[[0-9]*Pb" or
"\[[0-9]*Nb". On the other hand, aromatic boron is rare,
especially compared to lead, a substring search for "Pb" or
"Nb" will have few false positives. That said, if you justify
the approximate substring search for performance reasons over the
exact regular expression tesst , please make sure to measure the
difference to see if it's meaningful.
Cobalt and osmium
Cobalt/"Co" and osmium/"Os" are interesting middle
case which highlight the difference between valid SMILES syntax and
valid chemistry. In principle the substring "Co" can also
match an aliphatic carbon attached to an aromatic oxygen, and
"Os" match an aliphatic oxygen attached to an aromatic
sulfer.
However, that's chemically unreasonable. An aromatic atom must be part
of an aromatic ring. If the aliphatic notation indicates the first atom
isn't part of a ring, then the second atom must be connected to two
other (aromatic) bonds. For oxygen that can only happen if the atom is
charged, but we know from the syntax that's not the case. For sulfer,
well, I don't know enough chemistry to reject it outright, but I
searched PubChem and found no matches for the SMARTS
"So(:a):a".
This means it's safe to search a chemical database for "Co"
and likely also for "Os" and not get any false positives.
Aromatic atoms don't have to indicate aromatic ring membership
With one odd exception.
The lower-case letter doesn't really mean the atom must be in an
aromatic ring. Most toolkits interpret it as a hint that the atom is
sp2 hybridized as part of the input aromaticity processing. Here's an
example from RDKit with an aromatic oxygen in the input, but with no
other aromatic atoms, and where the toolkit doesn't actually perceive
it as aromatic:
>>> from rdkit import Chem
>>> Chem.CanonSmiles("C1CCCCCo1")
'C1CCCOCC1'
If you have an aromatic oxygen which isn't part of an aromatic ring
then a substring test for "Co" has a higher chance of giving
a false positive. However, you are not going to come across cases like
this in real data sets and I'm not going to worry about non-portable
SMILES like that.
Carbons
This section uses an advanced regular expression syntax. I don't
suggest you use this in real code, at least not without a lot more
testing. I present it here more to show off what can be done with a
light-weight approach.
It's difficult but not impossible to test if a carbon atom is
present. One problem is that the aliphatic carbon "C" is the
first letter of 11 other atomic element symbols including
chlorine. The other problem is the aromatic carbon "c" which
is the second letter of actinium and scandium. That right away tells
you that substring tests aren't possible. The third problem is that
"C", "c", and "Cl" are all part of the organic subset, that is, the
symbols which don't need to be in square brackets.
It's not that hard to detect a "C" or "c" inside of square brackets
using a regular expression like you've seen earlier. Here I'll used
the "verbose"
regular expression syntax, which lets me write the expression over
multiple lines and annotate each component:
\[ # inside of brackets
[0-9]* # skip any isotope
[Cc] # either 'C' or 'c'
[^a-z] # and not part of a two-letter element
The problem is in how to detect if a C or c is outside of square
brackets. For that, I use a lookahead
assertion to test if the match is not followed by some
other pattern. For example, here's a pattern to check if a "C" is not
followed by an "l":
C # C
(?!l) # not followed by an 'l' (which would be chlorine) ...
I'll also require that it not be followed by any non-"["
characters and then a "]" character, because that would mean
it's a bracket atom.
C # C
(?!l) # not followed by an 'l' (which would be chlorine) ...
(?![^][]*\]) # .. and not inside of brackets
This isn't the best of regular expressions because it will search to
the end of the string to look for any square bracket characters, which
might not be present. It's possible to construct better regular
expressions, perhaps by expressing the optional chirality, optional
hydrogen count, etc.. This would be more complicated. A less
complicated solution is to note that "[()=#]" are not allowed
inside of bracket atoms but common outside of them. The negative
lookahead assertion can stop if it finds one of those
characters. I'll leave you to work that out.
Here is is all put together, with some basic tests:
# WARNING: I haven't fully tested this. Let me know of any problems.
_carbon_atom_pat = re.compile(r"""
(
\[ # inside of brackets
[0-9]* # skip any isotope
[Cc] # either 'C' or 'c'
[^a-z] # and not part of a two-letter element
)| (
C # C (in the organic subset)
(?!l) # not followed by an 'l' (which would be chlorine) ...
(?![^][]*\]) # .. and not inside of brackets
) | (
c # c (in the organic subset)
(?![^][]*\]) # .. and not inside of brackets
)
""", re.X) # re.X and re.VERBOSE enable verbose mode syntax
def has_carbon_atom(smiles):
return _carbon_atom_pat.search(smiles) is not None
def test_has_carbon_atom():
for smiles in (
"C",
"[C]",
"[12C]",
"[CH4]",
"NC",
"ClNC",
"c1ccccc1",
"c1ccccc1[Sc]",
"n1nnnc1",
"C[Cl]",
"C[S]",
"[cH]1[cH][cH][cH][cH][cH]1",
"n1nnn[cH]1",
"N[13C@](P)(O)Br",
):
assert has_carbon_atom(smiles), smiles
for smiles in (
"Cl",
"[Cl]",
"[35Cl]",
"O=O",
"[Al]",
"[Cl+]",
"[Sc]",
):
assert not has_carbon_atom(smiles), smiles
It isn't pretty, but I think it does work, and give you a fast way to
find the inorganic structures in your data set.
Digressing a bit, it might be possible use a negative lookbehind
assertion. Python's lookbehind assertions, like most regular
expression implementations, are limited to fixed-length
patterns. That's okay since real-world atoms are limited to three
digits for the isotope, so there are only four possibilities to
enumerate. Even with that limitation, I wasn't able to make a valid
regular expression which passed the tests.
Hydrogens
It's impossble to count the number of hydrogens without having most of
a SMILES parser. How is simple string analysis supposed to know that
"O=O" has no hydrogens while "C=C" has four? Then add support for
charges, atoms with multiple allowed valences, and aromaticity.
The only exceptions are if either the hydrogens are written
explicitly, like "[C]([H])([H])([H])[H]", or written with an
explicitly specified implicit hydrogen count, like
"[CH4]". Both forms are rare.
If they are are all explicit, with no isotope, charged, or other atom
properties, then a substring test for "[H]" will work. Otherwise,
use the scandium-like regular expression:
\[[0-9]*H[^a-z]
As for expicitly specified implicit hydrogens, stay tuned for my next
essay, where I will go into atom-based properties.
Counts and multiple matches
I started this essay by pointing out that a substring test for
"Br" is equivalent to testing if the structure contains a
bromine. This can easily be extended to count the number of bromine
atoms by counting the number of time "Br" occurs in the
SMILES string.
In fact, every substring test can be converted to a count function, so
"smiles.count("V")" gives the number of vanadium atoms.
Count based on regular expressions
The regular expression tests can also be turned into a count function,
by using the "findall()" method instead of the
"search()" method. For example, the following counts the number
of scandium atoms in a SMILES string using the
_scandium_atom_pat from earlier:
If you do this, make sure that your regular expression pattern doesn't
accidentally include additional characters. For example, you can test
for boron by looking for a "B" followed by something that
isn't in "[aehikr]", perhaps with the regular expression:
B[^aehikr] # finds boron, but will also match BB
This would find that the SMILES BB,
contains a boron, by matching both the first and second B, since B is
not one of those lowercase letters. If you use this regular expression
in a findall, then the search would continue after the second
B, so you'll end up with one match instead of finding two boron atoms.
To fix this, switch from "[^aehikr]" to the negative
lookahead assertion (?![aehikr])".
B(?![aehikr]) # only matches boron element symbols
NOTE! This will also match some chiral specifications, like
"@TB1". However, I have never seen those outside of the SMILES
specification. If your data set does contain them, then you will need
to use more advanced techniques, like the tokenizer approach I will
present in a future essay.
Match multiple elements based on regular expressions
It's also possible to combine multiple regular expressions into a
large regular expression, to detect if all of them exist in a single
call to match(), rather than findall() to get match
counts. Here is a regular expression which matches at least two
scandium atoms:
\[[0-9]*Sc.*\[[0-9]*Sc
(You might also test the performance with the non-greedy
".*?" instead of the greedy ".*"; my tests with
"grep -P" showed they were the same.)
However, down this route lies two different forms of combinatorial
madness. If there are n scandium atoms, and you want n+1
then the backtracker will still try all 2n possible
partial matches. This will slow things down. You may want to consider
using a substring count first, then do the regular expression test
filter out false positives.
The other combinatorial explosion comes if you want to find a SMILES
string which contains multiple different element types. As a example,
if you want the (admittedly unlikely) structures with 1 Pb, 1 Sc, and
1 Hg atom, you'll have to write out all six possible orderings of
those terms in the combined regular expression.
Really, it's much better to tokenize and process each atom than it is
to force everything into a regular expression test. I'll describe how
to do that in a future essay (and link it from here once I've written
it).