The Artima Developer Community
Sponsored Link

Python Buzz Forum
Contains The Seeds Of Its Own Deconstruction

0 replies on 1 page.

Welcome Guest
  Sign In

Go back to the topic listing  Back to Topic List Click to reply to this topic  Reply to this Topic Click to search messages in this forum  Search Forum Click for a threaded view of the topic  Threaded View   
Previous Topic   Next Topic
Flat View: This topic has 0 replies on 1 page
Ben Last

Posts: 247
Nickname: benlast
Registered: May, 2004

Ben Last is no longer using Python.
Contains The Seeds Of Its Own Deconstruction Posted: Jun 15, 2004 1:12 PM
Reply to this message Reply

This post originated from an RSS feed registered with Python Buzz by Ben Last.
Original Post: Contains The Seeds Of Its Own Deconstruction
Feed Title: The Law Of Unintended Consequences
Feed URL: http://benlast.livejournal.com/data/rss
Feed Description: The Law Of Unintended Consequences
Latest Python Buzz Posts
Latest Python Buzz Posts by Ben Last
Latest Posts From The Law Of Unintended Consequences

Advertisement
Unicode is both wonderful, and yet not.

It's wonderful in that there is a worldwide (in effect) standard, widely supported, that allows for reasonably straightforward handling of strings in most any character set that one might need to consider.  The demons of complexity, however, crawl from their hiding places when it comes to dealing with the interface between, say, the Python implementation and the environment in which it might run; specifically, for me, when dealing with Unicode text files on Windows.

Let's have a context; Windows is rather happy with Unicode, especially the more recent incarnations like XP.  Notepad will eat and spit out files in UTF8 and "Unicode" (which is actually UTF16) form, marking them with appropriate BOMs, Byte Order Markers.  These serve a dual function; they identify the exact encoding of a file and also allow Windows tools to recognise it as Unicode rather than plain text.  I have, as part of The Mobile Phone Project, to deal with such files.

The question is then, how do we sensibly handle such files?  The codecs module provides a neat open() that returns a file-compatible wrapper to read them in, provided that you know the encoding of the file in advance.  But this is not always possible; we all know that given a possible error, a user somewhere, sometime will make it, and provide me a file that is in UTF16, not UTF8.

Well, just like Notepad or Word, we look at the BOM.  The codecs module provides a set of BOM constants, all defined as Python strings.  Why strings and not Unicode types?  My guess is that the typical use of these is to match the start of byte strings, to detect the encoding, so it makes sense to have them in the same form as byte strings.  However, this isn't quote enough - a BOM itself will convert to a valid Unicode character, but not a useful one (it's a zero-width non-breaking space), so it's enough to mess up many parsers.  We need the conversion to discard the BOM.

Thus we can start to write some code:
import codecs

#Not all BOMs have an appropriate equivalent codec.  However, #these are the BOMs encountered on Windows. BOMmap = {  codecs.BOM_UTF8 : 'utf_8',             codecs.BOM_UTF16_LE : 'utf_16_le',             codecs.BOM_UTF16_BE : 'utf_16_be'  }

maxBOMlen = max(map(lambda x: len(x), BOMmap.keys()))

def mapBOMToEncoding(data):     """Given a string in data, map any BOM found to an     encoding name.  Return a tuple of encoding, length of BOM.     Return None,0 if no match ocurred."""     for b in BOMmap.keys():         if data.startswith(b):             return (BOMmap[b], len(b))     return (None, 0)


This gives us a way to map a BOM found in a string to an encoding as well as the length of the BOM so that we can discard it.  Which makes for a simple function:
def encodedToUnicode(data, errors='strict'):
    """Given the data in a string, decode any BOM at the start to
    deduce the encoding, and return the Unicode string.  If the data
    does not hold a BOM, treat it as UTF_8.  UnicodeDecodeError exceptions
    may be thrown for invalid data.  The errors parameter is passed to the
    decode() function and may be the usual values ('string','replace','ignore')"""
    (encoding, offset) = mapBOMToEncoding(data)
    if not encoding:
    #If no BOM match was found, try utf8, since that
    #will eat ASCII properly.
        encoding = 'utf_8'
        offset = 0

    return data[offset:].decode(encoding, errors)


The above is a function suitable for use with data read from a builtin file, as in:
#Note the "rb", necessary on Windows to prevent text mangling.
f = open('IThinkItsUnicode.txt', 'rb')
#read all the data in one go
data = f.read()
f.close()

udata = encodedToUnicode(data)



But, of course, not all files are suitable to be yanked into memory in one go.  What would be useful would be a wrapper for a file object that detects the BOM, such as is already provided by the codecs module.  Here's a function that attempts to detect the BOM and return a codecs.StreamReader that decodes the data on the fly.
def openBOMFile(filename, errors='strict'):
    """A wrapper for codecs.open() that returns a codecs.StreamReader for the
    file, as determined by BOM.  No BOM means we assume UTF8.  The errors parameter
    is as passed to decode() methods and defaults to 'strict'.
    Opening the file via this method will cause the first few bytes to be read
    immediately.  The file must be rewindable (allow seeking backwards from the
    current tell())."""
    encoding = None
    offset = 0
    file = __builtins__.open(filename, 'rb')

    #Get the first few bytes of the file to check the BOM.     data = file.read(maxBOMlen)     if data:         (encoding, offset) = mapBOMToEncoding(data)     if not encoding: encoding = 'utf_8'

    #seek to the first character after the BOM     #do the seek even if the offset is zero, otherwise     #we may lose the first byte of the file.     file.seek(offset)     (e,d,srf,swf) = codecs.lookup(encoding)

    #Generate and return an appropriate StreamReader.     sr = srf(file, errors)     sr.encoding = encoding     return sr



The codecs module provides a number of useful classes for reading and writing encoded files.  Howevere, the StreamWriters don't write the BOM, so here's a quick example of writing a UTF8 file that Notepad will handle.
First, here's the method that does the writing; returning a codecs.StreamWriter object, with the BOM written.
def writeUTF8File(filename, mode='wb', errors='strict'):
    """Open the file for writing, write the UTF8 BOM and
    then wrap the file in a codecs.StreamWriter so that
    it can be used to spit out Unicode."""

#Ensure that the mode is binary. if not 'b' in mode: if mode.endswith('+'): mode = mode[0] + 'b+' else: mode = mode + 'b' file = __builtins__.open(filename, mode)

#We only write the BOM if the mode implies truncation and #writing; we can't do anything if the file already exists. if mode.startswith('w'): file.write(codecs.BOM_UTF8)

(e,d,srf,swf) = codecs.lookup('utf_8') # #Generate and return an appropriate StreamWriter. sw = swf(file, errors) sw.encoding = 'utf_8' return sw


And here's a call to it. Be careful, if you copy-and-paste this; the "Москва" in the comment is a Unicode string and by default, PythonWin won't like it, which is why I used Unicode escapes to build the literal.  Cyrillic is useful for test data, since all the characters are outside the eight-bit range (which is not true of most Western European sets).
    f = writeUTF8File('d:\\temp\\test.utf8','wb')
    #write 'Moscow' in Cyrillic/Russian (Москва)
    f.write(u'\u041c\u043e\u0441\u043a\u0432\u0430\r')
    f.close()


Unix systems, according to some apparently-biased sources, don't tend to use BOMs - they would break conventions such as the initial #! syntax for shells; thus on Unix systems you need to be sure of the encoding you're dealing with (in theory, the current locale defines the format of all input and output files, but few of us are lucky enough to have all our data and processing within a single locale).  That doesn't, of course, mean you can't use BOMs in your own data files; they may have been a Microsoft suggestion, but they're part of the Unicode standard and help address practical problems.  The back-end systems that support The Mobile Phone Project are all Linux-based and use UTF8 (with BOMs) in throughout.
Having said that, according to Unicode.org, KDE from v1.89 and later versions of GTK (from 2) are Unicode-compliant, using UTF16 or internally.

Read: Contains The Seeds Of Its Own Deconstruction

Topic: Message Board Software Previous Topic   Next Topic Topic: Catastrophe Theory

Sponsored Links



Google
  Web Artima.com   

Copyright © 1996-2019 Artima, Inc. All Rights Reserved. - Privacy Policy - Terms of Use