In my upcoming new job I have to write C# code on Windows. This was a choice of mine and the reasons for the choice are outside of the context of public disclosure. I am happy to say I haven't used a Microsoft product for much of anything for about 10 years or so. I write code for a living and my position remains that developing the sorts of plumbing code I write on Windows is like going to a gunfight with a knife. It's just the wrong tool.
Anyway, when I get some time at night I've been futzing about with a little program to add an XML element to Visual Studio project files. Because I have to use Visual Studio and it's doing something I don't like and to tell it not to do that something I'd have to modify 57 project files using a GUI. Which is stupid in many dimensions. I'm ok with stupid at some level. Wild cards with java generics for example. But I digress.
These project files have an extension of .csproj. If you happen to be on Windows and have one of these lying around open it up in some random editor. I used both notepad and textpad. Looks like normal XML right? As you'd expect. I mean it's either XML or some obviously non-XML thing. Right? How else would you do it? One or the other. Either, or.
Now do something like this (warning: C# code ahead):
XMLDocument document = new XMLDocument();
document.LoadXML("SomeProject.csproj");
watch in surprise as LoadXML() throws an exception telling you that the content is illegal at line 1, position 1. Look at the file in the editors. See XML? Yup. Go back and futz with the program. Try e.g.
document.Load(someStreamYouCreated)
Same exception. Be confused and frustrated.
In a fit of frustrated inspiration at 5:30AM fire up the machine, open CMD.EXE (truly an abortion of a CLI) and do this:
more SomeProject.csproj
Marvel at the 3 squiggly little characters at the beginning of the file. Indeed at line 1, position 1. WTF? In your program read past those 3 bytes and then do
document.Load(someStreamAfterReadingThreeBytes);
and see your program start to function.
Of the many reasons I truly detest Microsoft this is probably one of the largest. What absolute arrogance, in my mind, to jam 3 bytes at the front of this XML like file. Because it's not XML is it? So why make it look like it is? Is this an example of the vaunted Microsoft "innovation"? There may be some better way of loading these files. Some Microsoft approved way like VisualStudioProjectFileParser or whatever. I didn't find one when looking casually and, honestly, I don't think I should have to.
Hey, fellas! There's a whole world out here. You're not special or cool anymore. At all. Believe it.
C# is a rocking language though. I think I might prefer it to Java.
Er... that's just the byte order mark, which isn't uncommon for specifying UTF-16 XML files. As for why it doesn't load via XmlDocument, I have no clue.
It is probably the unicode BOM stuff that identifies the file as UTF-8. The first 3 bytes where probably EF BB BF. I haven't used the XML load stuff yet but I wonder if there is a method or parameter for Unicode files vs ansi/ascii.
> Er... that's just the byte order mark, which isn't > uncommon for specifying UTF-16 XML files. As for why it > doesn't load via XmlDocument, I have no clue.
Doesn't XML already have a place to specify the encoding of the file?
> Doesn't XML already have a place to specify the encoding > of the file? > > http://en.wikipedia.org/wiki/XML#International_use Nope, you have to be able to read that first 'encoding' line. If that line is UTF-16 you need a byte order mark for correct processing. So for multi-byte character sets you need some Byte Order Mark. You only need a BOM in UTF-8 to be able to distinguish this stream from UTF-16
> > Doesn't XML already have a place to specify the > encoding > > of the file? > > > > http://en.wikipedia.org/wiki/XML#International_use > Nope, you have to be able to read that first 'encoding' > line. > If that line is UTF-16 you need a byte order mark for > correct processing.
I've never needed it. I find your argument hard to believe for that reason.
"Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described in ISO/IEC 10646 [ISO/IEC 10646] or Unicode [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents."
> "Entities encoded in UTF-16 MUST and entities encoded in > UTF-8 MAY begin with the Byte Order Mark described in > ISO/IEC 10646 [ISO/IEC 10646] or Unicode [Unicode] (the > ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an > encoding signature, not part of either the markup or the > character data of the XML document. XML processors MUST be > able to use this character to differentiate between UTF-8 > and UTF-16 encoded documents."
That would mean the .NET parsing API being used has a bug, right?
> That would mean the .NET parsing API being used has a bug, > right? correct. The crimson parser (in Java) behaves the same way, though.
Besides, there are good reasons to use the BOM - even if you're using UTF8 only: Some editors (Intellij IDEA for example) won't understand the <?xml .. UTF8> and screw your file - unless the BOM is present.
Then again, some CVS interfaces will ignore the BOM and not show it in the history. (maybe it's in the CVS server, too...)
> That would mean the .NET parsing API being used has a bug, > right?
No. It means the programmer didn't read the documentation and just assumed he knew what the method was designed to do. The ReadXML function reads from a string and converts it to XML:
public virtual void LoadXml( string xml )
Since the first "character" in the file isn't a valid string character, an exception is thrown (as it should be).
(This part isn't addressed to James specifically) I think a general principle that all of us as developers should keep in mind is that there is a danger in using tools or languages that we are prejudiced against. We can waste a lot of time blaming our own bugs on others.
> > That would mean the .NET parsing API being used has a > bug, > > right? > > No. It means the programmer didn't read the documentation > and just assumed he knew what the method was designed to > do. The ReadXML function reads from a string and converts > it to XML: > > public virtual void LoadXml( > string xml > ) > > > Since the first "character" in the file isn't a valid > string character, an exception is thrown (as it should > be). > > (This part isn't addressed to James specifically) > I think a general principle that all of us as developers > should keep in mind is that there is a danger in using > tools or languages that we are prejudiced against. We can > waste a lot of time blaming our own bugs on others.
OOPS, I wrote "ReadXML" instead of "LoadXML". That damned Microsoft!