This post originated from an RSS feed registered with Java Buzz
by Marc Logemann.
Original Post: CP850 charset - still in use :(
Feed Title: Marc's Java Blog
Feed URL: http://www.logemann.org/day/index_java.xml
Feed Description: Java related topics for all major areas. So you will see J2ME, J2SE and J2EE issues here.
I am currently developing a program which interacts with data from Deutsche Post World Net. We speak of one of the largest logistic providers worldwide. For this program to work, i have to read in about 500MB of flatfile data i got on CD from Deutsche Post. I thought this is a perfect choice for NIO (i have not been using NIO so far and was excited).
So i ve written a small Testprogram to read in the data from filesystem and wondered why i dont get the german umlauts like öäü correctly. A first check with a HEX editor showed that the "ü" for example had a hex representation of 81. I was quite sure that in ISO-8859-1 the "ü" is not at 81. And indeed, it seems i am dealing with a different charset. After some more investigation i found out that they used CP850, a charset with its momentum at MS-DOS times. Great.
I though i can just switch the encoding in my sourcode, but then i realized that NIO doesnt support CP850, only plain java.io does. This is the end of the story regarding NIO usage and its even more frustrating because reading in 500mb of flatfile data would need any performance boost i can get, but ok.
It seems they didnt change the way of data distribution since the beginning of computing. I recently heard that they offer an alternative way of obtaining the data, perhaps via FTP and perhaps they can offer different charset of their files via this route. Let see. Dealing with encoding issues is allways a pleasure, because its never fast to solve and allways includes checking charset tables on some obsure sites in the internet.