[wp-trac] [WordPress Trac] #9992: Atom feed titles are CDATA'ed and XML encoded.
WordPress Trac
wp-trac at lists.automattic.com
Mon Mar 8 19:57:40 UTC 2010
#9992: Atom feed titles are CDATA'ed and XML encoded.
------------------------------------------------+---------------------------
Reporter: pm24601 | Owner: josephscott
Type: defect (bug) | Status: reopened
Priority: normal | Milestone: 3.0
Component: Feeds | Version: 2.7.1
Severity: normal | Resolution:
Keywords: has-patch tested reporter-feedback |
------------------------------------------------+---------------------------
Changes (by jarrettc):
* cc: jarrettc (added)
Comment:
The XML is wrong, and this ''is'' a real problem. I'll explain the reasons
for both assertions.
The problem is that the XML is escaped twice. Entity-encoding is one
method for escaping XML's control characters. CDATA is another method.
Either one can be used. But if you use them both, you're escaping twice.
Take the following string as an example:
Johnson & Johnson
If I entity-encode it, I have:
Johnson & Johnson
Now, if I wrap it in CDATA, I have:
<![CDATA[Johnson & Johnson]]>
A well-behaved XML parser will decode this string as "Johnson &
Johnson," which is not what we want. The decoded string should be "Johnson
& Johnson."
Here's the W3C's spec on CDATA:
http://www.w3.org/TR/REC-xml/#sec-cdata-sect
As the W3C says, ampersands inside CDATA ''are treated literally.'' This
is why <![CDATA[Johnson & Johnson]]> is decoded as "Johnson &
Johnson."
Even if we can't name a specific client that chokes on the twice-escaped
XML Wordpress produces, it is a very bad idea to spit out incorrect XML.
The practical reason--and this is the practical reason for all W3C
standards--is that you want your output to be readable by ''any''
standards-compliant client. The fact that we can't name a client that
requires proper XML doesn't mean one doesn't exist. Nor should we expect
developers of future clients to pander to the incorrect XML we produce. If
we continue double-escaping our XML, we run the risk of creating something
analogous to quirks mode in web browsers: clients will have to figure out
on a case-by-case basis whether a given feed uses proper XML, or the
quirky, double-escaped Wordpress style. They'll have to say, "Is this a
Wordpress feed? If so, I should take into account that the markup is
improperly escaped. But if not, I should follow the W3C standard."
The fact that the markup validates does not prove it is correct. To the
contrary, the XML's encoding is improper according to the W3C standard. So
why doesn't the validator complain? Beceause it doesn't know the markup is
double-escaped. It thinks you ''intend'' for the HTML entities to be
literals, rather than markup. For all the validator knows, you could be
writing a how-to on HTML entities, which would properly include the
entities inside CDATA. For example, this would be perfectly valid:
<![CDATA[The XML entity for the ampersand is &.]]>
In the above example, the author's intent was for & to be treated as a
literal string, rather than be replaced with "&" after the XML is parsed.
So the code is correct. But the W3C validator can't guess your intent, so
it can't complain about the following:
<![CDATA[Johnson & Johnson]]>
From the validator's perspective, it's quite possible that you wanted
& to appear as a literal after parsing, when in fact you were most
likely trying to write "Johnson & Johnson." But the validator doesn't
(and, practically speaking, can't) know what you intended, so it has to
give you the benefit of the doubt when you include entities inside CDATA.
--
Ticket URL: <http://core.trac.wordpress.org/ticket/9992#comment:16>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software
More information about the wp-trac
mailing list