This post originated from an RSS feed registered with Python Buzz
by Andrew Dalke.
Original Post: Reading ASCII file in Python3.5 is 2-3x faster as bytes than string
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
I'm porting chemfp from Python2 to
Python3. I read a lot of ASCII files. I'm trying to figure out if it's
better to read them as binary bytes or as text strings.
No matter how I tweak Python3's open() parameters, I can't get the
string read performance to within a factor of 2 of the bytes read
performance. As I haven't seen much discussion of this, I figured I
would document it here.
chemfp reads chemistry file formats which are specified as ASCII.
They contain user-specified fields which are 8-bit clean, so sometimes
people use them to encode non-ASCII data. For example, the SD tag
field "price" might include the price in £GBP or €EUR, and
include the currency symbol either as Latin-1 or UTF-8. (I haven't
come across other encodings, but I've also never worked with SD files
used internally in, say, a Japanese pharamceutical company.)
These are text files, so it makes sense to read it as text, right? The
main problem is, reading in "r" mode is a lot slower than reading "rb"
mode. Here's my benchmark, which uses Python 3.5.2 on a Mac OS X
10.10.5 machine to read the first 10MiB from a 3.1GiB file:
% python -V
Python 3.5.2
% python -m timeit 'open("chembl_21.sdf", "r").read(10*1024*1024)'
100 loops, best of 3: 10.3 msec per loop
% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.74 msec per loop
The Unicode string read() is much slower than the byte string read(),
with a performance ratio of 2.75. (I'll give all numbers in ratios.)
Python2 had a similar problem. I originally used "U"niversal mode in
chemfp to read the text files in FPS format, but found that if I
switched from "rU" to "rB", and wrote my code to support both '\n' and
'\r\n' conventions, I could double my overall system read performance
- the "U" option gives a 10x slowdown!
% python2.7 -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.7 msec per loop
% python2.7 -m timeit 'open("chembl_21.sdf", "rU").read(10*1024*1024)'
10 loops, best of 3: 36.7 msec per loop
Python 2 and Python 3 read bytes at the same speed
In Python 2, decoding Unicode is 10x slower than reading bytes
In Python 3, decoding Unicode is 3-7x slower than reading bytes
In Python 3, universal newline conversion is ~1.5x slower than skipping it, at least if the file has DOS newlines
In Python 3, codecs.open() is faster than open().
The Python3 open() function takes more parameters than Python2,
including 'newline', which affects how the text mode reader identifies
newlines, and 'encoding':
open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)
...
newline controls how universal newlines works (it only applies to text
mode). It can be None, '', '\n', '\r', and '\r\n'. It works as
follows:
* On input, if newline is None, universal newlines mode is
enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
these are translated into '\n' before being returned to the
caller. If it is '', universal newline mode is enabled, but line
endings are returned to the caller untranslated. If it has any of
the other legal values, input lines are only terminated by the given
string, and the line ending is returned to the caller untranslated.
I'll 'deuniversalize' the text reader and benchmark newline="\n" and newline="":
% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.81 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
python -m timeit 'open("chembl_21.sdf", "r", newline="").read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline=None).read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
The ratio of 2.3 for newline="\n" slowndown is better than the 2.75
for univeral newlines and the newline="" case that Nelson Minar
tested, but still less than half the performance of the byte reader.
I also wondered if the encoding made a difference:
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="ascii").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="utf8").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="latin-1").read(10*1024*1024)'
100 loops, best of 3: 10.1 msec per loop
My benchmark shows that ASCII and UTF-8 encodings are equally fast,
and Latin-1 is 14% slower, even though my data set contains only
ASCII. I did not expect any difference. I assume a lot of time has
been spent making the UTF-8 code go fast, but don't know why the
Latin-1 reader is noticably slower on ASCII data.
Nelson Minar also tested the codecs.open() performance, so I'll repeat it:
% python -m timeit -s 'import codecs' 'codecs.open("chembl_21.sdf", "r").read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
I noticed no performance difference between codec.open() and builtin
open() for this test case.
I've left with a bit of a quandary. I work with ASCII text data, with
only the occasional non-ASCII field. For example, chemfp has
specialized code to read an id tag and encoded fingerprint field from
an SD file. In rare and non-standard cases, the handful of characters
in the id/title line might be non-ASCII, but the hex-encoded
fingerprint is never anything other than ASCII. It makes sense to use
the text reader. But If I use the text reader, it will decode
everything in each record (typically 2K-8K bytes), when I only need to
decode at most 100 bytes of the record.
In chemfp, I used to have a two-pass solution to find records in an
SD file. The first pass found the fields of interest, and the second
counted newlines for better error reporting. I found that even that
level of data re-scanning caused an observable slowdown, so I
shouldn't be surprised that an extra pass to check for non-ASCII
characters might also be a problem. But, two-fold slowdown?
This performance overhead leads me to conclude that I need to process
my performance critical files as bytes, rather than strings, and delay
the byte-to-string decoding as much as possible.
RDKit and (non-)Unicode
I checked with the RDKit, which is a cheminformatics toolkit. The core
is in C++, with Python extensions through Boost.Python. It treats the
files as bytes, and lazily exposes the data to Python as Unicode. For
example, if I places a byte string which is not valid UTF-8 in the
title or tag field, then it will read and write the data without
problems, because the data is stored in C++ data structures based on
the byte string. But if I try to get the properties from Python, I get
a UnicodeDecodeError.
Here's an example. First, I'll get a valid record, which is all ASCII:
>>> block = open("chembl_21.sdf", "rb").read(10000)
>>> i = block.find(b"$$$$\n")
>>> i
973
>>> record = block[:i+5]
>>> chr(max(record))
'm'
I'll use the RDKit to parse the record, then show that I can read the
molecule id, which comes from the first line (the title line) of the
file:
If I then create a 'bad_record', by prefixing chr(0xC2) as the first byte
to the record, then I can still process the record, but I cannot get
the title:
>>> bad_record = b"\xC2" + record
>>> bad_mol = Chem.MolFromMolBlock(bad_record)
>>> bad_mol.GetNumAtoms()
16
>>> bad_mol.GetProp("_Name")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
This is because character 0xC2 is not a valid byte in UTF-8, and the RDKit
uses a UTF-8 to bytes error handler which fails for invalid bytes.
I can even use C++ to write the string to a file, since the C++ code
treats everything as bytes:
(The symbol could not be represented in my UTF-8 terminal, so it uses a "?".)
On the other hand, I get a UnicodeDecodeError if I use a Python file object:
>>> from io import BytesIO
>>> f = BytesIO()
>>> outfile = Chem.SDWriter(f)
>>> outfile.write(bad_mol)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
(It doesn't matter if I replace the BytesIO with a StringIO. It hasn't
gotten to that point yet. When the SDWriter() is initialized with a
Python file handle then told to write a molecule, it writes the
molecule to a C++ byte string, converts that to a Python string, and
passes that Python string to the file handle. The failure here is in
the C++ to Python translation.)
The simple conclusion from this is the same as the punchline from the
old joke "Doctor, doctor, it hurts when I do this." "Then don't do
that." But it's too simple. SD files come from all sorts of places,
including sources which may use '\xC2' as the Latin-1 encoding of
Â. You don't want your seemingly well-tested system to crash
because of something like this.
I'm not sure what the right solution is for the RDKit, but I can
conclude that I need a test case something like this for any of my SD
readers, and that correct support for the bytes/string translation is
not easy.
Want to leave a comment?
If you have a better suggestion than using bytes, like a faster way to
read ASCII text as Unicode strings, or have some insight into why
reading an ASCII file as a string is so relatively slow in Python3, let
me know.
But don't get me wrong. I do scientific programming, which with rare
exceptions is centered around the 7-bit ASCII world of the
1960s. Python3 has made great efforts to make Unicode parsing and
decoding fast, which is important for most real-world data outside of
my niche.