[wp-trac] [WordPress Trac] #63611: wp_widget_rss_output: HTML entities that are part of HTML tags should be removed

Thu Dec 11 04:18:01 UTC 2025

#63611: wp_widget_rss_output: HTML entities that are part of HTML tags should be
removed
-------------------------------------------------+-------------------------
 Reporter:  wildworks                            |       Owner:  wildworks
     Type:  defect (bug)                         |      Status:  assigned
 Priority:  normal                               |   Milestone:  7.0
Component:  Widgets                              |     Version:
 Severity:  normal                               |  Resolution:
 Keywords:  good-first-bug has-test-info has-    |     Focuses:
  patch commit has-unit-tests                    |
-------------------------------------------------+-------------------------

Comment (by dmsnell):

 Today I ran some analysis on a set of around 30,000 RSS feeds I found,
 which were source from ingesting a Bluesky feed. Following are some
 insights. For context, we currently rely on `SimplePie` for parsing the
 RSS feeds, which seems to be based on the concept of various RSS
 specifications and ATOM specifications. Unfortunately, with RSS/Atom
 feeds, producers are frequently implementing the specifications in diverse
 ways.

 There are potentials to switch to a content-based approach where WordPress
 infers content type based on what it sees. For example, let us consider
 content-carrying elements like `TITLE`, `DESCRIPTION`, `CONTENT`, and
 `CONTENT:ENCODED` (unfortunately there’s no universal agreement on what
 //encoded// means here, as it could be HTML or XML).

 {{{#!php
 <?php
 // Some malformed HTML contains things which look like CDATA sections and
 aren’t,
 // but usually in an RSS feed if one is present, it’s XML. Common RSS
 feeds also contain
 // elements comprising only of a single CDATA section, which could also be
 checked for.
 // These CDATA sections are purely for packaging the content, not for
 indicating what
 // type of content they are; so unpack it and try again.
 if ( contains_cdata_section( $content ) ) {
         return 'xml-decode-data-then-reassess';
 }

 // Assuming there are no CDATA sections, there could still be raw tags,
 but
 // these raw tags might be XHTML embedded within the XML of the feed, or
 // HTML found inside the feed. A giveaway of directly-embedded XHTML is
 // the presence of namespace directives. These should not contain encoded
 // HTML because they are the content themselves.
 if ( contains_tag( $content ) ) {
         if ( contains_xmlns_attribute( $content ) ) {
                 return 'parse-as-xhtml';
         } else {
                 return 'parse-as-html';
         }
 }

 // With no tags and no character references it’s all plaintext, it’s all
 the same.
 if ( ! contains_character_reference( $content ) ) {
         return 'plaintext-nothing-to-do';
 }

 // XML 1.0 only defines > < & " and ' so if other
 named character references
 // are present it should be decoded as an HTML text node.
 if ( contains_named_character_reference_other_than_big_5( $content ) ) {
         return 'parse-as-html';
 }

 // At this point the content could be HTML or HTML encoded inside XML. The
 only character
 // references are the syntax characters and numeric character references,
 which do not give
 // away the nature of the content. The guessing comes from detecting the
 pattern of <div>
 // as these are unlikely to occur in normal text. Unfortunately, this
 leads to mis-detection if someone
 // is writing _about_ HTML tags and literally encoded the syntax to
 preserve it. There should be a
 // heuristic here to make a choice in the presence of ambiguity, but it’s
 likely best to assume that
 // encodings of tags are actually tags.
 $decoded = WP_HTML_Decoder::decode_text_node( $content );
 if ( contains_tag( $decoded ) ) {
         return 'decode-then-parse-as-html';
 }

 return 'decode-then-plaintext';
 }}}

 Unfortunately, tags like `<content:encoded>` suggest that we have some
 underlying HTML or XHTML content inside them, but that indicator doesn’t
 tell us which, and its absence doesn’t imply there //isn’t// underlying
 HTML or XHTML.

 We might look to best practices such as in a feed like `https://nijigen-
 daily.com/atom.xml` which provides tags like this…

 {{{
 <content type="text/html" mode="escaped" xml:lang="ja" xml:base="https
 ://nijigen-daily.com/archives/12944226.html">
 <![CDATA[<a  target="_blank"
 href="https://livedoor.blogimg.jp/nijigen_daily/imgs/3/e/3ebc3caf.jpg"><img
 src="https://livedoor.blogimg.jp/nijigen_daily/imgs/3/e/3ebc3caf.jpg"
 class="res-img" alt="【Key】無自覚にイケメン4人侍らせてるやつ|にじげん！デ
 イリー"></a><div  class="res-thread">
 </div>
 <div  class="res-thread"><div  class="res-block"><div  class="res-
 head"><span  class="res-name">1: 名無しさん</span><span  class="res-
 datetime">25/12/08(月)20:42</span><span  class="res-
 likes"></span></div><div  class="res-text">女子からしたら結構羨ましい立ち
 位置なんだろうか</div></div>
 <div  class="res-replies"><div  class="res-block res-reply"><div  class
 ="res-head"><span  class="res-name">21: 名無しさん</span><span  class
 ="res-datetime">25/12/08(月)20:52</span><span  class="res-likes">そうだね
 x12</span></div><div  class="res-text pink"><span  class="res-
 anchor">>>1</span><br />女子から嫌がらせされるくらいガチで嫌われてた
 はず</div></div>
 </div></div>
 <div  class="res-thread"><div  class="res-block"><div  class="res-
 head"><span  class="res-name">2: 名無しさん</span><span  class="res-
 datetime">25/12/08(月)20:43</span><span  class="res-likes">そうだね
 x5</span></div><div  class="res-text purple">小毬ちゃん入るまで女友達0だっ
 たからな…</div></div>
 </div>
 <div  class="res-thread"><div  class="res-block"><div  class="res-
 head"><span  class="res-name">3: 名無しさん</span><span  class="res-
 datetime">25/12/08(月)20:43</span><span  class="res-likes">そうだね
 x25</span></div><div  class="res-text red">本当に侍らせるのは理樹
 </div></div>
 </div>
 <link  href="https://nijigen-daily.com/nijigen_daily.css"
 rel="stylesheet">
 <a href="https://nijigen-daily.com/archives/12944238.html">続きを読む
 </a>]]>
 </content>
 }}}

 And we can say, “yes, thankfully someone indicates the encoding in the
 attributes” because indeed, the content //is// HTML serialized inside XML,
 not as XHTML but as an opaque text value of the element. However, earlier
 in the same document we find this…

 {{{
 <summary type="text/plain">
 &gt;一体誰なのだ…だ…だれがいうかーーーーっ！！　風のようすがへんなのだ
 　雲じゃねーか！　新一という秘孔を突いた　ユ… ユ…！！　ゆうかーーーっ！！
 　ぬぅ！志村けんのカキタレ…！！　｜北斗の拳｜ジャンプ｜漫画・アニメ・ゲー
 ム記事のまとめサイトならにじげん！デイリー
 </summary>
 }}}

 So while this //positively identifies// the content as plaintext, we find
 after properly decoding the XML text node that we //start// with `>一体
 誰なのだ…` which almost //certainly// should start `>一体誰なのだ…`,
 meaning the `type` should be `type="text/html" mode="escaped"`.

 ----

 Another example is `https://www.gadgetguy.com.au/feed/`. Here we find
 `<description>` with no attributes and it contains an encoded form of what
 would parse equally as HTML or XHTML. Later in the same feed we find
 `<content:encoded>` containing the same thing, which would seem to imply
 that `CONTENT` is encoded but `DESCRIPTION` isn’t, but that’s not true.
 This comes from WordPress.

 ----

 For `https://www.tagesschau.de/index~atom.xml` we find this interesting
 oddity:

 {{{
 <summary type="text/html" mode="escaped">Die USA haben…</summary>
     <content mode="escaped"><![CDATA[<p> <a
 href="https://www.tagesschau.de/ausland/amerika/trump-venezuela-
 tanker-100.html"><img
 src="https://images.tagesschau.de/image/0ff925d9-86a2-4888-9d98-56b86ee94412/AAABmwnwQxU/AAABmt42H9g/16x9-big/trump-4488.jpg?width=1920"
 alt="Donald Trump | AFP" /></a> <br/> <br/>Die USA
 haben…[<a href="https://www.tagesschau.de/ausland/amerika/trump-
 venezuela-tanker-100.html">mehr</a>]</p>]]></content>
 }}}

 Here, both `SUMMARY` and `CONTENT` are `mode="escaped"`, but `SUMMARY` is
 implied to be different, as `type="text/html"`. Ironically, it contains
 //only// plaintext and lacks even a single character reference. Meanwhile,
 the `CONTENT` actually has XML //double-encoded// as XML, which then
 encodes HTML. This requires some level of recursion if not intending to
 hard-code it.

 {{{#!php
 <?php
 $content = get_content_element()->textContent;
 $first_decode = html_entity_decode( $content, ENT_XML1 | ENT_SUBSTITUTE,
 'UTF-8' );
 $html = parse_xml( $first_decode )->textContent;
 }}}

 Most feeds, for `<content mode="escaped">` seem to produce this instead…
 {{{
 <![CDATA[<p> <a href="https://www.tagesschau.de/ausland/amerika/trump-
 venezuela-tanker-100.html"><img
 src="https://images.tagesschau.de/image/0ff925d9-86a2-4888-9d98-56b86ee94412/AAABmwnwQxU/AAABmt42H9g/16x9-big/trump-4488.jpg?width=1920"
 alt="Donald Trump | AFP" /></a> <br/> <br/>Die USA haben vor der Küste
 Venezuelas einen Tanker unter ihre Kontrolle gebracht. Das bestätigte US-
 Präsident Trump. Seit Wochen erhöhen die USA den Druck auf Venezuela und
 verlegen Seestreitkräfte in die Region.[<a
 href="https://www.tagesschau.de/ausland/amerika/trump-venezuela-
 tanker-100.html">mehr</a>]</p>]]>
 }}}

 We can note how someone took the intended serialized XML and then ran it
 through something like `htmlspecialchars()` to hide it, much like what
 happened as the motivating case for this ticket.

 ----

 == Get on with it!

  - `type=text/plain` //might// indicate that we should avoid decoding
 //after deserializing from XML//.
  - `mode="escaped"` doesn’t communicate anything, because //all// HTML
 seems to be escaped, and if it’s missing that, it can only be plaintext or
 embedded XHTML; however, if there are tag-like things, it’s almost
 certainly XHTML. if, on the other hand, it’s missing the mode and there
 are things which look like tags //after// unescaping, it’s probably
 escaped anyway.
  - this is the kind of thing that probably //has// to rely on some
 heuristics based on the content in the item itself. feeds sometimes
 aggregate items and encoding models may diverge within the same XML
 document.

 I hope to automate the scanning of all of the RSS feeds I downloaded,
 including categorizing these into RSS vs. ATOM explorations, but that will
 take more time than I had today. needless to say, I think the current
 approach is failing us (parsing based on our inference of the
 specifications). `SimplePie` is supposed to already decode and “sanitize”
 content, and that causes confusion in the diverse world of feeds.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63611#comment:34>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform