Wednesday, November 09, 2005

XML Python and Characters...

A little while ago python moved from having only 8bit strings that were treated as byte arrays to unicode supported strings. Personally I think if you're going to move, you have decide on a direction and move.

On a more interesting note, the python xml.dom.minidom provides support for parsing XML. When parsing a utf-8 encoded string it converts the Text and CDATA nodes into a python unicode string. Nice if it's routines can guess information about the source correctly.
Now image you have a CDATA node with the contents "\r\r\n" (using the C programming language representations of the line feed and carrigage return characters). A person would usually expect the XML parser to give a unicode string representation of "\r\r\n". This is not the case. In fact the character string becomes "\n\n". So if you're thinking of reliably extracting textual data from an XML document in python, I can only recommend staying away from python's minidom (python 2.3.5 under win32).