Singpolyma

Technical Blog

The Importance of XML Well-formedness

Posted on

XHTML validity is a buzzword around the Internet, but many people generally agree that it is not all that important. It has its advantages, but it is not the end of the world if you can’t quite get it. XML well-formedness, however, is very important. Why? Because it makes server-side hackery much easier. That may not be the only reason, but it is an important one. Some people have mastered the art of screen-scraping with RegExps, but I and others like me have never quite mastered that often-complicated technique. Instead, it is much easier to parse the webpage as XML and pull out the data that way. This works especially well when the page is known to conform to some standard (as in the code addition for Blogger Recent Comments).

While some leniancy can be built in, here are some basic guidelines for keeping your pages well-formed and making our job that much easier:

  1. XHTML empty tags — some tags, such as <br>, <link …>, and others used to be written in HTML as you see them there. This breaks XML well-formedness. Instead, one should use <br />, <link … /> and the like. (note to advanced users, this can be partially overcome using a RegExp line similar to $XMLdata = preg_replace(‘/<(img|meta|link|hr|br)([^<>]*?)([\/]?)>/i’,'<$1$2 />’, $XMLdata); )
  2. Escaping out Ampersands — Many URLs contain the ‘&’ character, and sometimes this character is used in content as well. If this character is left unescaped it breaks XML well-formedness. Use ‘&amp;’ instead. (note to advanced users, this can be mostly overcome use a RegExp line similar to $XMLdata = preg_replace(‘/&([^;]{10})/i’,’&amp;$1′, $XMLdata); )
  3. Escaping Scripts — JavaScript code will often contain characters that must be escaped out in XML, but which cannot be escaped out if the script is to work. To overcome this you add ‘//<![CDATA[‘ after every <script> tag and ‘//]]>’ before every </script> tag.
  4. Closing tags — Some tags, such as <p> are often inserted by web designers without a closing tag. instead of ‘<p>text<p>more text’ use ‘<p>text</p><p>more text</p>’. Note that XML is case-sensetive, so if you open a section, say, with <head> you must end it with </head> not </HEAD>
  5. Quoting Attributes — <p class="1"> not <p class=1>, etc. Quotation marks always go around attributes, no matter what.
  6. Non-tag < > — If you reference a Blogger template tag (such as <$BlogID$>) or for some other reason need to include a < or > character in content, you must escape it out with &lt; or &gt;, respecively.

A note about content : Blogger’s post form and comment form is not very good at checking XML well-formedness. Thus if you want to maintain a (at least mostly) well-formed page you must follow these rules in any code entered in these forms. For example, if you enter a < character in the blogger post form, it does not escape it out for you, you must actually enter &lt;, and the same goes for the comment form. This is sometimes annoying if you are trying to maintain full XML well-formedness because a well-meaning commentor can sometimes mess up your well-formedness and you must go and edit their comment. This is not usually the biggest problem, however, since it is usually one of the first two problems which can be overcome as noted. You can check for XML-formedness without validating XHTML using this tool.

Tags:

One Response

Ariel

Thanks for the XML tips. Now I feel compelled to go take a long, hard look at my template.

Leave a Response