Saturday, November 05, 2005

Extracting Raw XHTML from an XML document...

One of the cited reasons that a person might want to use XHTML or XML safe HTML, is simply the extraction of a document or text fragment from within another that's ready for display. While this a good idea it may fall down in a number of places.

Yesterday I was misassigned a bug. Not my code, and we haven't started non-ownership fixing yet. Anyway the bug was a simply highlighting bug. The letters were all squished together. e.g. "<em>tag1</em><em>tag2</em>"

Now the way this text looked in the XML was "<containingtag>...<em>tag1</em> <em>tag2</em>..</containingtag>". So why has the extraction using an XML DOM parser failed?

System.Xml.XmlDocument (Microsoft .NET).
An XML document doesn't need to treat whitespace between tags as important and by default it shouldn't. Hence the implementation of XmlDocument will eat the whitespace gap between tags.
So if we wish to use .InnerXml to get our snippet in a preserved state we'll need to do the following.

Set the preserve whitespace attribute on the document to true. This will allow you to get the inner XML from tags and preserve and any whitespace you've placed between your HTML tags.