The Artima Developer Community
Sponsored Link

Python Buzz Forum
Reading ASCII file in Python3.5 is 2-3x faster as bytes than string

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Andrew Dalke

Posts: 291
Nickname: dalke
Registered: Sep, 2003

Andrew Dalke is a consultant and software developer in computational chemistry and biology.
Reading ASCII file in Python3.5 is 2-3x faster as bytes than string Posted: Aug 3, 2016 10:28 AM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Andrew Dalke.
Original Post: Reading ASCII file in Python3.5 is 2-3x faster as bytes than string
Feed Title: Andrew Dalke's writings
Feed URL: http://www.dalkescientific.com/writings/diary/diary-rss.xml
Feed Description: Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure.
Latest Python Buzz Posts
Latest Python Buzz Posts by Andrew Dalke
Latest Posts From Andrew Dalke's writings

Advertisement

I'm porting chemfp from Python2 to Python3. I read a lot of ASCII files. I'm trying to figure out if it's better to read them as binary bytes or as text strings.

No matter how I tweak Python3's open() parameters, I can't get the string read performance to within a factor of 2 of the bytes read performance. As I haven't seen much discussion of this, I figured I would document it here.

chemfp reads chemistry file formats which are specified as ASCII. They contain user-specified fields which are 8-bit clean, so sometimes people use them to encode non-ASCII data. For example, the SD tag field "price" might include the price in £GBP or €EUR, and include the currency symbol either as Latin-1 or UTF-8. (I haven't come across other encodings, but I've also never worked with SD files used internally in, say, a Japanese pharamceutical company.)

These are text files, so it makes sense to read it as text, right? The main problem is, reading in "r" mode is a lot slower than reading "rb" mode. Here's my benchmark, which uses Python 3.5.2 on a Mac OS X 10.10.5 machine to read the first 10MiB from a 3.1GiB file:

% python -V
Python 3.5.2
% python -m timeit 'open("chembl_21.sdf", "r").read(10*1024*1024)'
100 loops, best of 3: 10.3 msec per loop
% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.74 msec per loop
The Unicode string read() is much slower than the byte string read(), with a performance ratio of 2.75. (I'll give all numbers in ratios.)

Python2 had a similar problem. I originally used "U"niversal mode in chemfp to read the text files in FPS format, but found that if I switched from "rU" to "rB", and wrote my code to support both '\n' and '\r\n' conventions, I could double my overall system read performance - the "U" option gives a 10x slowdown!

% python2.7 -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.7 msec per loop
% python2.7 -m timeit 'open("chembl_21.sdf", "rU").read(10*1024*1024)'
10 loops, best of 3: 36.7 msec per loop

This observation is not new. A quick Duck Duck Go search found a 2015 blog post by Nelson Minar which concluded:

  • Python 2 and Python 3 read bytes at the same speed
  • In Python 2, decoding Unicode is 10x slower than reading bytes
  • In Python 3, decoding Unicode is 3-7x slower than reading bytes
  • In Python 3, universal newline conversion is ~1.5x slower than skipping it, at least if the file has DOS newlines
  • In Python 3, codecs.open() is faster than open().

The Python3 open() function takes more parameters than Python2, including 'newline', which affects how the text mode reader identifies newlines, and 'encoding':

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True)
  ...
    newline controls how universal newlines works (it only applies to text
    mode). It can be None, '', '\n', '\r', and '\r\n'.  It works as
    follows:
    
    * On input, if newline is None, universal newlines mode is
      enabled. Lines in the input can end in '\n', '\r', or '\r\n', and
      these are translated into '\n' before being returned to the
      caller. If it is '', universal newline mode is enabled, but line
      endings are returned to the caller untranslated. If it has any of
      the other legal values, input lines are only terminated by the given
      string, and the line ending is returned to the caller untranslated.

I'll 'deuniversalize' the text reader and benchmark newline="\n" and newline="":

% python -m timeit 'open("chembl_21.sdf", "rb").read(10*1024*1024)'
100 loops, best of 3: 3.81 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
python -m timeit 'open("chembl_21.sdf", "r", newline="").read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline=None).read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
The ratio of 2.3 for newline="\n" slowndown is better than the 2.75 for univeral newlines and the newline="" case that Nelson Minar tested, but still less than half the performance of the byte reader.

I also wondered if the encoding made a difference:

% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="ascii").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="utf8").read(10*1024*1024)'
100 loops, best of 3: 8.8 msec per loop
% python -m timeit 'open("chembl_21.sdf", "r", newline="\n", encoding="latin-1").read(10*1024*1024)'
100 loops, best of 3: 10.1 msec per loop
My benchmark shows that ASCII and UTF-8 encodings are equally fast, and Latin-1 is 14% slower, even though my data set contains only ASCII. I did not expect any difference. I assume a lot of time has been spent making the UTF-8 code go fast, but don't know why the Latin-1 reader is noticably slower on ASCII data.

Nelson Minar also tested the codecs.open() performance, so I'll repeat it:
% python -m timeit -s 'import codecs' 'codecs.open("chembl_21.sdf", "r").read(10*1024*1024)'
100 loops, best of 3: 10.2 msec per loop
I noticed no performance difference between codec.open() and builtin open() for this test case.

I've left with a bit of a quandary. I work with ASCII text data, with only the occasional non-ASCII field. For example, chemfp has specialized code to read an id tag and encoded fingerprint field from an SD file. In rare and non-standard cases, the handful of characters in the id/title line might be non-ASCII, but the hex-encoded fingerprint is never anything other than ASCII. It makes sense to use the text reader. But If I use the text reader, it will decode everything in each record (typically 2K-8K bytes), when I only need to decode at most 100 bytes of the record.

In chemfp, I used to have a two-pass solution to find records in an SD file. The first pass found the fields of interest, and the second counted newlines for better error reporting. I found that even that level of data re-scanning caused an observable slowdown, so I shouldn't be surprised that an extra pass to check for non-ASCII characters might also be a problem. But, two-fold slowdown?

This performance overhead leads me to conclude that I need to process my performance critical files as bytes, rather than strings, and delay the byte-to-string decoding as much as possible.

RDKit and (non-)Unicode

I checked with the RDKit, which is a cheminformatics toolkit. The core is in C++, with Python extensions through Boost.Python. It treats the files as bytes, and lazily exposes the data to Python as Unicode. For example, if I places a byte string which is not valid UTF-8 in the title or tag field, then it will read and write the data without problems, because the data is stored in C++ data structures based on the byte string. But if I try to get the properties from Python, I get a UnicodeDecodeError.

Here's an example. First, I'll get a valid record, which is all ASCII:

>>> block = open("chembl_21.sdf", "rb").read(10000)
>>> i = block.find(b"$$$$\n")
>>> i
973
>>> record = block[:i+5]
>>> chr(max(record))
'm'
I'll use the RDKit to parse the record, then show that I can read the molecule id, which comes from the first line (the title line) of the file:
>>> from rdkit import Chem
>>> mol = Chem.MolFromMolBlock(record)
>>> mol.GetProp("_Name")
'CHEMBL153534'
If I then create a 'bad_record', by prefixing chr(0xC2) as the first byte to the record, then I can still process the record, but I cannot get the title:
>>> bad_record = b"\xC2" + record
>>> bad_mol = Chem.MolFromMolBlock(bad_record)
>>> bad_mol.GetNumAtoms()
16
>>> bad_mol.GetProp("_Name")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
This is because character 0xC2 is not a valid byte in UTF-8, and the RDKit uses a UTF-8 to bytes error handler which fails for invalid bytes.

I can even use C++ to write the string to a file, since the C++ code treats everything as bytes:

>>> outfile = Chem.SDWriter("/dev/tty")
>>> outfile.write(mol); outfile.flush()
CHEMBL153534
     RDKit          2D

 16 17  0  0  0  0  0  0  0  0999 V2000
   ...
>>> outfile.write(bad_mol); outfile.flush()
?CHEMBL153534
     RDKit          2D

 16 17  0  0  0  0  0  0  0  0999 V2000
 ...
(The symbol could not be represented in my UTF-8 terminal, so it uses a "?".)

On the other hand, I get a UnicodeDecodeError if I use a Python file object:

>>> from io import BytesIO
>>> f = BytesIO()
>>> outfile = Chem.SDWriter(f)
>>> outfile.write(bad_mol)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte
(It doesn't matter if I replace the BytesIO with a StringIO. It hasn't gotten to that point yet. When the SDWriter() is initialized with a Python file handle then told to write a molecule, it writes the molecule to a C++ byte string, converts that to a Python string, and passes that Python string to the file handle. The failure here is in the C++ to Python translation.)

The simple conclusion from this is the same as the punchline from the old joke "Doctor, doctor, it hurts when I do this." "Then don't do that." But it's too simple. SD files come from all sorts of places, including sources which may use '\xC2' as the Latin-1 encoding of Â. You don't want your seemingly well-tested system to crash because of something like this.

I'm not sure what the right solution is for the RDKit, but I can conclude that I need a test case something like this for any of my SD readers, and that correct support for the bytes/string translation is not easy.

Want to leave a comment?

If you have a better suggestion than using bytes, like a faster way to read ASCII text as Unicode strings, or have some insight into why reading an ASCII file as a string is so relatively slow in Python3, let me know.

But don't get me wrong. I do scientific programming, which with rare exceptions is centered around the 7-bit ASCII world of the 1960s. Python3 has made great efforts to make Unicode parsing and decoding fast, which is important for most real-world data outside of my niche.

Read: Reading ASCII file in Python3.5 is 2-3x faster as bytes than string

Topic: Molecular fragments, R-groups, and functional groups Previous Topic   Next Topic Topic: Negative Sequence Indices in Python

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use