This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Molecular fragments, R-groups, and functional groups
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
For a change of pace, I figured I would do a basic chemistry lesson
about molecular structures, instead of a more computer oriented blog
post.
Chemists often think about a molecule as a core structure (usually a
ring system) and a set of R-groups. Each
R-group is attached to an atom in the core structure by a
bond. Typically that bond is a single bond, and often "rotatable".
Here's an example of what I mean. The first image below shows the
structure of vanillin, which is
the primary taste behind vanilla. In the second image, I've
circled ellipsed the three R-groups in the structure.
Vanillin structure (the primary taste of vanilla)
Vanillin with three R-groups identified
The R-groups in this case are R1=a carbonyl group (*-CH=O2), R2=a
methoxy group (*-O-CH3), and R3=a hydroxyl group (*-OH), where the "*"
inidicates where the R-group attaches to the core structure.
The R-group concept is flexible. Really it just means that you have a
fixed group of connected atoms, which are connected along some bond to
a variable group of atoms, and where the variable group is denoted
R. Instead of looking at the core structure and a set of R-groups, I
can invert the thinking and think of an R-group, like the carbonyl
group, as "the core structure", and the rest of the vanillin as
its R-group.
With that in mind, I'll replace the "*" with the "R" to get the groups
"R-CH=O2", "R-O-CH3", and "R-OH". (The "*" means that the fragment is
connected to an atom at this point, but it's really just an
alternative naming scheme for "R".)
All three of these group are also functional
groups. Quoting Wikipedia, "functional groups are specific groups
(moieties) of atoms or bonds within molecules that are responsible for
the characteristic chemical reactions of those molecules. The same
functional group will undergo the same or similar chemical reaction(s)
regardless of the size of the molecule it is a part of."
These three corresponding functional groups are
R1 = aldehyde,
R2 = ether.
and R3 = hydroxyl.
As the Wikipedia quote pointed out, if you have reaction which acts on
an aldehyde, you can likely use it on the aldehyde group of vanillin.
Vanillyl group and capsaicin
A functional group can also contain functional groups. I pointed to
the three functional groups attached to the central ring of a
vanillin, but most of the vanillin structure is itself another
functional group, a vanillyn:
Structures which contain a vanillyl group are called
vanilloids. Vanilla
is of course a vanilloid, but surprisingly so is capsaicin, the source
of the "heat" to many a spicy food. Here's the capsaicin structure,
with the vanillyl group circled:
The feeling of heat comes because the capsaicin binds to
TrpV1 (the transient
receptor potential cation channel subfamily V member 1), also known as
the "capsaicin receptor". It's a nonselective recepter, which means
that many things can cause it to activate. Quoting that Wikipedia
page: "The best-known activators of TRPV1 are: temperature greater
than 43 °C (109 °F); acidic conditions; capsaicin, the
irritating compound in hot chili peppers; and allyl isothiocyanate,
the pungent compound in mustard and wasabi." The same receptor detects
temperature, capsaicin, and a compound in hot mustard and wasabi,
which is why your body interprets them all as "hot."
Capsaicin is a member of the capsaicinoid family. All capsaicinoids
are vanillyls, all vanillyls are aldehydes. This sort of is-a family
membership relationship in chemistry has lead to many taxonomies and
ontologies, including ChEBI.
But don't let my example or the existence of nomenclature lead you to
the wrong conclusion that all R-groups are functional groups! An
R-group, at least with the people I usually work with, is a more
generic term used to describe a way of thinking about molecular
structures.
QSAR modeling
QSAR
(pronounced "QUE-SAR") is short for "quantitative structure-activity
relationship", which is a mouthful. (I once travelled to the UK for a
UK-QSAR meeting. The border inspecter asked me where I was going, and
I said "the UK-QSAR meeting; QSAR is .." and I blanked on the
expansion of that term! I was allowed across the border, so it
couldn't have been that big of a mistake.)
QSAR deals with the development of models which relate chemical
structure to its activity in a biological or chemical system. Looking
at that, I realize I just moved the words around a bit, so I'll give
a simple example.
Consider an activity, which I'll call "molecular weight". (This is
more of a physical property than a chemical one, but I am trying to
make it simple.) My model for molecular weight assumes that each atom
has its own weight, and the total molecular weight is the sum of the
individual atom weights. I can create a training set of molecules, and
for each molecule determine its structure and molecular weight. With a
bit of least-squares fitting, I can determine the individual atom
weight contribution. Once I have that model, I can use it to predict
the molecular weight of any molecule which contains atoms which the
model knows about.
Obviously this model will be pretty accurate. It won't be perfect,
because isotopic ratios can vary. (A chemical synthesized from fossil
oil is slightly lighter and less radioactive than the same chemical
derived from from environmental sources, because the heavier
radioactive 14C in fossil oil has decayed.) But for most
uses it will be good enough.
A more chemically oriented property is the partition coefficient,
measured in log units as "log P", which is a measure of the solubility
in water compared to a type of oil. This gives a rough idea of if the
molecule will tend to end up in hydrophobic regions like a cell
membrane, or in aqueous regions like blood. One way to predict log P
is with the atom-based approach I sketched for the molecular weight,
where each atom type has a contribution to the overall measured log
P. (This is sometimes called AlogP.)
In practice, atom-based solutions are not as accurate as
fragment-based solutions. The molecular weight can be atom-centered
because nearly all of the mass is in the atom's nucleous, which is
well localized to the atom. But chemistry isn't really about atoms but
about the electron density around atoms, and electrons are much less
localized than nucleons. The density around an atom depends on the
neighboring atoms and the configuration of the atoms in space.
As a way to improve on that, some methods look at the extended local
environment (this is sometimes called XlogP) or at larger fragment
contributions (for example, BioByte's ClogP). The more complex it is,
the more compounds you need for the training and the slower the
model. But hopefully the result is more accurate, so long as you don't
overfit the model.
Every major method from data mining, and most of the minor methods,
have been applied to QSAR models. The history is also quite long. There
are cheminformatics papers back from the 1970s looking at supervised
and unsupervised learning, building on even earlier work on clustering
applied to biological systems.
A problem with most of these is the black-box nature. The data is
noisy, and the quantum nature of chemistry isn't that good of a match
to data mining tools, so these prediction are used more often to guide
a pharmaceutical chemist than to make solid predictions. This means
the conclusions should be interpretable by the chemist. Try getting
your neural net to give a chemically reasonable explanation of why it
predicted as it did!
Matched molecular pair (MMP) analysis
is a more chemist-oriented QSAR method, with relatively little
mathematics beyond simple statistics. Chemists have long looked at
activities in simple series, like replacing a ethyl (*-CH3) with a
methyl (*-CH2-CH3) or propyl (*-CH2-CH2-CH3), or replacing a fluorine
with a heavier halogen like a chlorine or bromine. These can form
consistent trends across a wide range of structures, and chemists have
used these observations to develop techniques for how to, say, improve
the solubility of a drug candidate.
MMP systematizes this analysis over all considered fragments,
including not just R-groups (which are connected to the rest of the
structure by one bond) but also so-called "core" structures with two
or three R-groups attached to it. For example, if the known structures
can be described as "A-B-C", "A-D-C", "E-B-F" and "E-D-F" with
activities of 1.2, 1.5, 2.3, and 2.6 respectively then we can do the
following analysis:
A-B-C transforms to A-D-C with an activity shift of 0.3.
E-B-F transforms to E-D-F with an activity shift of 0.3.
Both transforms can be described as R1-B-R2 to R1-D-R2.
Perhaps R1-B-R2 to R1-D-R2 in general causes a shift of 0.3?
Its not quite as easy as this, because the molecular fragments aren't
so easily identified. A molecule might be described as "A-B-C", as
well as "E-Q-F" and "E-H" and "C-T(-P)-A", where "T" has three
R-groups connected to it.
Thanks
Thank to the EPAM Life
Sciences for their Ketcher
tool, which I used for the structure depictions that weren't public domain on Wikipedia.