This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Software mentioned in ChEMBL SD records
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
When I work with real-world data sets, I usually start with PubChem
before going on to ChEMBL. Why? The PubChem data is all generated by
one chemistry toolkit - I'm pretty sure it's OEChem - while ChEMBL
data comes from many sources.
To get a sense of the diversity, I processed the ChEBML 21 release to
get the second line of each record. The ctfile format specification
says that if non-blank then the 8 characters starting in the third
position should contain the program name. The documentation includes
a description of the fields:
IIPPPPPPPPMMDDYYHHmmddSS…
where those fields are:
II: two characters for the user's initials
PPPPPPPP: eight characters for the program name
MMDDYY: two characters for the month, two for the day, two for the year
HHmm: two characters for the hour, two for the minutes
I'll write a program to extract those program names and count how many
times each one occurs. I don't need a general-purpose SD file reader
because ChEMBL uses a subset of the SD format. For example, there are
only two lines in each record which start with "CHEMBL", the title
line (the first line of the record) and the data line after the
"chembl_id" tag.
My code can read line through the file. The first time it sees a
"CHEMBL" line is the title line, so the following line (the second
line) contains the data. Then when it sees "> <chembl_id>" it
knows to skip the following line, which will have a CHEMBL on it.
There are two oddities here. First, the gzip reader returns byte
strings. I decided to do the pattern matching on the byte strings to
ignore the overhead
of converting everything to Unicode when I only need a few
characters from the file. Second, a Python file object is its own
iterator, so I can use "infile" in both the for-loop and in the body
itself.
import sys, gzip
from collections import Counter
counter = Counter()
with gzip.open("/Users/dalke/databases/chembl_21.sdf.gz", "rb") as infile:
for line in infile:
# Get the line after the title line (which starts with 'CHEMBL')
if line[:6] == b"CHEMBL":
next_line = next(infile)
# Print unique program names
program = next_line[2:10]
if program not in counter:
print("New:", repr(str(program, "ascii")))
counter[program] += 1
if line[:13] == b"> ":
ignore = next(infile) # skip the CHEMBL
print("Done. Here are the counts for each program seen.")
for name, count in counter.most_common():
print(repr(str(name, "ascii")), count)
Program counts
Here is the table of counts:
Program name
count
'SciTegic'
1105617
' '
326445
'CDK 9'
 69145
'Mrv0541 '
30962
''
24531
'CDK 6'
10805
'CDK 1'
8642
'CDK 5'
4771
'CDK 8'
2209
'-ISIS- '
281
'Symyx '
171
'CDK 7'
144
'-OEChem-'
96
'Marvin '
61
'Mrv0540 '
13
' RDKit'
3
'CDK 3'
1
The number of different CDK lines is not because of a version number
but because CDK doesn't format the line correctly. The specification
states that the first few fields of a non-blank line are supposed to be:
were the date starts one byte too early. The date also isn't in the
specified format.
Further examination shows these were only generated in 2009 and 2010. The current
CDK implementation is correct, and a 'git annotate' suggests it was
fixed in 2010.
I don't think anyone uses that line for anything, and don't see the
point of changing anything, so I don't think it's worthwhile to even
ask ChEBML to change these old records.