[wp-trac] [WordPress Trac] #9992: Atom feed titles are CDATA'ed and XML encoded.

Mon Mar 8 19:57:40 UTC 2010

#9992: Atom feed titles are CDATA'ed and XML encoded.
------------------------------------------------+---------------------------
 Reporter:  pm24601                             |        Owner:  josephscott
     Type:  defect (bug)                        |       Status:  reopened   
 Priority:  normal                              |    Milestone:  3.0        
Component:  Feeds                               |      Version:  2.7.1      
 Severity:  normal                              |   Resolution:             
 Keywords:  has-patch tested reporter-feedback  |  
------------------------------------------------+---------------------------
Changes (by jarrettc):

 * cc: jarrettc (added)

Comment:

 The XML is wrong, and this ''is'' a real problem. I'll explain the reasons
 for both assertions.

 The problem is that the XML is escaped twice. Entity-encoding is one
 method for escaping XML's control characters. CDATA is another method.
 Either one can be used. But if you use them both, you're escaping twice.
 Take the following string as an example:

   Johnson & Johnson

 If I entity-encode it, I have:

   Johnson &amp; Johnson

 Now, if I wrap it in CDATA, I have:

   <![CDATA[Johnson &amp; Johnson]]>

 A well-behaved XML parser will decode this string as "Johnson &amp;
 Johnson," which is not what we want. The decoded string should be "Johnson
 & Johnson."

 Here's the W3C's spec on CDATA:

 http://www.w3.org/TR/REC-xml/#sec-cdata-sect

 As the W3C says, ampersands inside CDATA ''are treated literally.'' This
 is why <![CDATA[Johnson &amp; Johnson]]> is decoded as "Johnson &amp;
 Johnson."

 Even if we can't name a specific client that chokes on the twice-escaped
 XML Wordpress produces, it is a very bad idea to spit out incorrect XML.
 The practical reason--and this is the practical reason for all W3C
 standards--is that you want your output to be readable by ''any''
 standards-compliant client. The fact that we can't name a client that
 requires proper XML doesn't mean one doesn't exist. Nor should we expect
 developers of future clients to pander to the incorrect XML we produce. If
 we continue double-escaping our XML, we run the risk of creating something
 analogous to quirks mode in web browsers: clients will have to figure out
 on a case-by-case basis whether a given feed uses proper XML, or the
 quirky, double-escaped Wordpress style. They'll have to say, "Is this a
 Wordpress feed? If so, I should take into account that the markup is
 improperly escaped. But if not, I should follow the W3C standard."

 The fact that the markup validates does not prove it is correct. To the
 contrary, the XML's encoding is improper according to the W3C standard. So
 why doesn't the validator complain? Beceause it doesn't know the markup is
 double-escaped. It thinks you ''intend'' for the HTML entities to be
 literals, rather than markup. For all the validator knows, you could be
 writing a how-to on HTML entities, which would properly include the
 entities inside CDATA. For example, this would be perfectly valid:

   <![CDATA[The XML entity for the ampersand is &amp;.]]>

 In the above example, the author's intent was for &amp; to be treated as a
 literal string, rather than be replaced with "&" after the XML is parsed.
 So the code is correct. But the W3C validator can't guess your intent, so
 it can't complain about the following:

   <![CDATA[Johnson &amp; Johnson]]>

 From the validator's perspective, it's quite possible that you wanted
 &amp; to appear as a literal after parsing, when in fact you were most
 likely trying to write "Johnson & Johnson." But the validator doesn't
 (and, practically speaking, can't) know what you intended, so it has to
 give you the benefit of the doubt when you include entities inside CDATA.

-- 
Ticket URL: <http://core.trac.wordpress.org/ticket/9992#comment:16>
WordPress Trac <http://core.trac.wordpress.org/>
WordPress blogging software