I’ve been thinking about signing XML nodes. The existing mechanisms are either really complex (XML-DSig) or over-verbose (Magic Sig). This could be useful in RSS/ATOM feeds, XMPP, and other XML-based communication formats. The purpose of this proposal is to provide a lightweight signing (and optionally, encyption) mechanism for embedding inside XML nodes, while not inventing any new XML namespaces, elements, or attributes, not inventing a new envelope format for the signature data, and not suggesting a new way of transmitting octet streams in a text safe way.
Normalization
In order to preserve the form of the XML being signed, an exact textual representation of the XML tree to be signed must be included in the signature packet (“opaque signing”). This is similar to the strategy employed by Magic Sig.
It is recommended that the fragment be encoded as a valid standalone XML document, so that parsers can easily feed the unwrapped content to an XML parser and use the tree that results, without having to graft the text back into the original XML document for parsing.
Envelope format
Rather than inventing a new envelope to mark up what algorithms were used to generate the signature, I suggest using the standard OpenPGP packet format from RFC4880. This standard is well-deployed for use in Email and other cryptosystems, and there are implementations, or partial implementations, in many languages, including PHP.
Inclusion in an XML node
An opaquely signed XML fragment is just an alternative representation of the node it wraps. This relationship is well modelled by the ATOM link
element (namespace http://www.w3.org/2005/Atom) with the rel
attribute set to alternate
.
RFC3156 defines an Internet media type for encrypted and/or signed OpenPGP data as application/pgp-encrypted
. This makes an appropriate content for the type
attribute.
Text-safe encoding of octets
Protocols may wish to include the OpenPGP packet directly in the XML document, instead of linking to an external resource. In fact, this is probably the normal case. RFC2397 defines a useful mechanism for encoding arbitrary octet streams (such as those used in the OpenPGP binary packet format) as URIs for use anywhere a URI is expected, such as the link
element’s href
attribute. The media type included in the data URI should be application/pgp-encrypted
.
Example
Below is an ATOM fragment demonstrating this recommendation:
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>Test Feed</title>
<entry>
<title>Test Entry</title>
<link rel="alternate" type="text/html" href="http://example.com/item1" />
<published>2010-03-26T06:47:47+03:00</published>
<link rel="alternate" type="application/pgp-encrypted" href="data:application/pgp-encrypted;base64,owGbwMvMwMF4UUZT8FzgnHuMa/4msZUY6VXk5nhvuC9pYw9kKJSlFhVn5ufZKhnqGSgppOYl56dk5qXbKoWGuOlaKNnbcdmk5pUUVSoA1eYV2ypllJQUWOnrl5eX65Ub6+UXpesbGRiY6juW5OcqAdWWZJbkpNqFpBaXKLiCtNnoQ0S4bHIy87IVilJzbJUSc0pSi/ISS1KVFEoqC1JtlUpSK0r0M0pyc5QUMopS0+CWpFYk5hbkpOol5+fqZ5ak5hoqKegDTSooTcrJLM5ITbEzMjA00DUw1jUyCzEwszIxByJtA2MrAwMbfYQiLht9sA/suDqZZFgYGDkY2FiZQP5n4OIUgAXN+W/8v5h/vXu4cBXbrfUVT0yXTKsI+h1Qw2/zdYZ9h/yJ+/r5L7U1xc/sWMTh4Du1MXjKWsMHP2KZ0ooy7ZVFz0ps011ZaZDUl78y5JbdCbeMNcEcjb0tS1yEu7hfZX6Wsi+qeMDRZO1/1+Lt8c1fqloOLHSqe6YyX8dsp9HmtZlXkxdZVT/67Jcnxig59Rj/Gv0Nr0RPdSg68a/r3vKvV3tavvvyXP6wtWu3NHJrFR25euzI/VdnVMzjuDYt47gjxnm956jU72cvfXj7Dk79+yqfkyPCxfyo/ZFdTJtfK1ncYgur6Dqp47r4EcfZ2pDNwk4n5vYwOfjuWM09Ye6ij+nnbi2Zn2ld/zhx4o+Z2XoF6Twucayxa80urHWeMF0nw/mq2fLFhuv1PWaXRtoHHr9qMc9e+tjdD5KhxQV3Jr49d2YZ2+WjD49UfqtLYVxfxzy3iiG4YWvZHfl2g/yPtx3M+rqVuVbM2Vpzq9q76OytfQsjm769Ely2PGDTmg8ak2frnfL9/49VPKPU8vj0Y0diN213uNkrfeH15TbjM09el/2rCF5v/Uryd3XXB8OfbedesyRJH1s7dWPB04JFvA9Zp6y7IfI2VDo8VHiHVpmqR+aERRt/7Dy1NtmV+48KH8/l7ebCr4KNb7v3SSqfdWDey7lkQ+Nzwdfb1vOlTzXU1zyx0kVX3v2t9h+f1T1px+KuX7rw8gOb17St+69/kvkKAA" />
</entry>
</feed>
In response to the popular confusion about XML well-formedness and a recent nudgeing by Greg, I have upgraded Blogger Recent Comments. People who have been there before will note that there is now one less instruction — XML well-formedness is no longer necessary! I have tested this with a Blogger template on which I purposely broke well-formedness and the comments still came through fine! Originally introduced on this blog, Blogger Recent Comments is a Ning app that can automatially generate RSS, JavaScript, and JSON feeds of all comments on your blog. Setup is now just three easy steps!
XHTML validity is a buzzword around the Internet, but many people generally agree that it is not all that important. It has its advantages, but it is not the end of the world if you can’t quite get it. XML well-formedness, however, is very important. Why? Because it makes server-side hackery much easier. That may not be the only reason, but it is an important one. Some people have mastered the art of screen-scraping with RegExps, but I and others like me have never quite mastered that often-complicated technique. Instead, it is much easier to parse the webpage as XML and pull out the data that way. This works especially well when the page is known to conform to some standard (as in the code addition for Blogger Recent Comments).
While some leniancy can be built in, here are some basic guidelines for keeping your pages well-formed and making our job that much easier:
- XHTML empty tags — some tags, such as <br>, <link …>, and others used to be written in HTML as you see them there. This breaks XML well-formedness. Instead, one should use <br />, <link … /> and the like. (note to advanced users, this can be partially overcome using a RegExp line similar to $XMLdata = preg_replace(‘/<(img|meta|link|hr|br)([^<>]*?)([\/]?)>/i’,'<$1$2 />’, $XMLdata); )
- Escaping out Ampersands — Many URLs contain the ‘&’ character, and sometimes this character is used in content as well. If this character is left unescaped it breaks XML well-formedness. Use ‘&’ instead. (note to advanced users, this can be mostly overcome use a RegExp line similar to $XMLdata = preg_replace(‘/&([^;]{10})/i’,’&$1′, $XMLdata); )
- Escaping Scripts — JavaScript code will often contain characters that must be escaped out in XML, but which cannot be escaped out if the script is to work. To overcome this you add ‘//<![CDATA[‘ after every <script> tag and ‘//]]>’ before every </script> tag.
- Closing tags — Some tags, such as <p> are often inserted by web designers without a closing tag. instead of ‘<p>text<p>more text’ use ‘<p>text</p><p>more text</p>’. Note that XML is case-sensetive, so if you open a section, say, with <head> you must end it with </head> not </HEAD>
- Quoting Attributes — <p class="1"> not <p class=1>, etc. Quotation marks always go around attributes, no matter what.
- Non-tag < > — If you reference a Blogger template tag (such as <$BlogID$>) or for some other reason need to include a < or > character in content, you must escape it out with < or >, respecively.
A note about content : Blogger’s post form and comment form is not very good at checking XML well-formedness. Thus if you want to maintain a (at least mostly) well-formed page you must follow these rules in any code entered in these forms. For example, if you enter a < character in the blogger post form, it does not escape it out for you, you must actually enter <, and the same goes for the comment form. This is sometimes annoying if you are trying to maintain full XML well-formedness because a well-meaning commentor can sometimes mess up your well-formedness and you must go and edit their comment. This is not usually the biggest problem, however, since it is usually one of the first two problems which can be overcome as noted. You can check for XML-formedness without validating XHTML using this tool.
The list of XOXO Developer’s Resources has been updated to include the Outline Classes. Written in PHP4 but with compatability built in for PHP5 this set of classes is designed to be able to parse and create XOXO, OPML, hAtom, JSON, and arbitrary XML documents and fragments. The classes are GPL‘ed.