gworld - an exploration of a looney's world....: November 2005

Wednesday, November 09, 2005

XML Python and Characters...

A little while ago python moved from having only 8bit strings that were treated as byte arrays to unicode supported strings. Personally I think if you're going to move, you have decide on a direction and move.

On a more interesting note, the python xml.dom.minidom provides support for parsing XML. When parsing a utf-8 encoded string it converts the Text and CDATA nodes into a python unicode string. Nice if it's routines can guess information about the source correctly.
Now image you have a CDATA node with the contents "\r\r\n" (using the C programming language representations of the line feed and carrigage return characters). A person would usually expect the XML parser to give a unicode string representation of "\r\r\n". This is not the case. In fact the character string becomes "\n\n". So if you're thinking of reliably extracting textual data from an XML document in python, I can only recommend staying away from python's minidom (python 2.3.5 under win32).

Saturday, November 05, 2005

Extracting Raw XHTML from an XML document...

One of the cited reasons that a person might want to use XHTML or XML safe HTML, is simply the extraction of a document or text fragment from within another that's ready for display. While this a good idea it may fall down in a number of places.

Yesterday I was misassigned a bug. Not my code, and we haven't started non-ownership fixing yet. Anyway the bug was a simply highlighting bug. The letters were all squished together. e.g. "tag1tag2"

Now the way this text looked in the XML was "<containingtag>...tag1 tag2..</containingtag>". So why has the extraction using an XML DOM parser failed?

System.Xml.XmlDocument (Microsoft .NET).
An XML document doesn't need to treat whitespace between tags as important and by default it shouldn't. Hence the implementation of XmlDocument will eat the whitespace gap between tags.
So if we wish to use .InnerXml to get our snippet in a preserved state we'll need to do the following.

Set the preserve whitespace attribute on the document to true. This will allow you to get the inner XML from tags and preserve and any whitespace you've placed between your HTML tags.

Friday, November 04, 2005

Nostalgia

I was having a little bit of nostalgia about my old house mate ginnly (Virginia) and more importantly flight. I said that I'd go for my pilots licence after I got my motorcycle licence.
Well I managed to get my bike licence a little while back, and I'm now still on my L's and have had two accidents so far. So this here is a little bit of a kick in my pants to get me up and going, so that I'll start getting lessons again. I can only assume that she's still getting lessons down in Melbourne.
So after I'm not so horrendously broke it should be flying lessons once again.

Creation

With the creation of anything in this case a blog, I feel that there should be a purpose or a statement. While this is a blog that allows me to write about whatever I feel, I do intend to use it as a place write technical information, about programming and the techniques around it. I'd also like to include information about myself, but I will I endevour to keep these two areas quite separate.

gworld - an exploration of a looney's world....

Blog Archive

About Me

Garrick"s shared items