[wp-trac] [WordPress Trac] #63611: wp_widget_rss_output: HTML entities that are part of HTML tags should be removed
WordPress Trac
noreply at wordpress.org
Thu Dec 11 04:18:01 UTC 2025
#63611: wp_widget_rss_output: HTML entities that are part of HTML tags should be
removed
-------------------------------------------------+-------------------------
Reporter: wildworks | Owner: wildworks
Type: defect (bug) | Status: assigned
Priority: normal | Milestone: 7.0
Component: Widgets | Version:
Severity: normal | Resolution:
Keywords: good-first-bug has-test-info has- | Focuses:
patch commit has-unit-tests |
-------------------------------------------------+-------------------------
Comment (by dmsnell):
Today I ran some analysis on a set of around 30,000 RSS feeds I found,
which were source from ingesting a Bluesky feed. Following are some
insights. For context, we currently rely on `SimplePie` for parsing the
RSS feeds, which seems to be based on the concept of various RSS
specifications and ATOM specifications. Unfortunately, with RSS/Atom
feeds, producers are frequently implementing the specifications in diverse
ways.
There are potentials to switch to a content-based approach where WordPress
infers content type based on what it sees. For example, let us consider
content-carrying elements like `TITLE`, `DESCRIPTION`, `CONTENT`, and
`CONTENT:ENCODED` (unfortunately there’s no universal agreement on what
//encoded// means here, as it could be HTML or XML).
{{{#!php
<?php
// Some malformed HTML contains things which look like CDATA sections and
aren’t,
// but usually in an RSS feed if one is present, it’s XML. Common RSS
feeds also contain
// elements comprising only of a single CDATA section, which could also be
checked for.
// These CDATA sections are purely for packaging the content, not for
indicating what
// type of content they are; so unpack it and try again.
if ( contains_cdata_section( $content ) ) {
return 'xml-decode-data-then-reassess';
}
// Assuming there are no CDATA sections, there could still be raw tags,
but
// these raw tags might be XHTML embedded within the XML of the feed, or
// HTML found inside the feed. A giveaway of directly-embedded XHTML is
// the presence of namespace directives. These should not contain encoded
// HTML because they are the content themselves.
if ( contains_tag( $content ) ) {
if ( contains_xmlns_attribute( $content ) ) {
return 'parse-as-xhtml';
} else {
return 'parse-as-html';
}
}
// With no tags and no character references it’s all plaintext, it’s all
the same.
if ( ! contains_character_reference( $content ) ) {
return 'plaintext-nothing-to-do';
}
// XML 1.0 only defines > < & " and ' so if other
named character references
// are present it should be decoded as an HTML text node.
if ( contains_named_character_reference_other_than_big_5( $content ) ) {
return 'parse-as-html';
}
// At this point the content could be HTML or HTML encoded inside XML. The
only character
// references are the syntax characters and numeric character references,
which do not give
// away the nature of the content. The guessing comes from detecting the
pattern of <div>
// as these are unlikely to occur in normal text. Unfortunately, this
leads to mis-detection if someone
// is writing _about_ HTML tags and literally encoded the syntax to
preserve it. There should be a
// heuristic here to make a choice in the presence of ambiguity, but it’s
likely best to assume that
// encodings of tags are actually tags.
$decoded = WP_HTML_Decoder::decode_text_node( $content );
if ( contains_tag( $decoded ) ) {
return 'decode-then-parse-as-html';
}
return 'decode-then-plaintext';
}}}
Unfortunately, tags like `<content:encoded>` suggest that we have some
underlying HTML or XHTML content inside them, but that indicator doesn’t
tell us which, and its absence doesn’t imply there //isn’t// underlying
HTML or XHTML.
We might look to best practices such as in a feed like `https://nijigen-
daily.com/atom.xml` which provides tags like this…
{{{
<content type="text/html" mode="escaped" xml:lang="ja" xml:base="https
://nijigen-daily.com/archives/12944226.html">
<![CDATA[<a target="_blank"
href="https://livedoor.blogimg.jp/nijigen_daily/imgs/3/e/3ebc3caf.jpg"><img
src="https://livedoor.blogimg.jp/nijigen_daily/imgs/3/e/3ebc3caf.jpg"
class="res-img" alt="【Key】無自覚にイケメン4人侍らせてるやつ|にじげん!デ
イリー"></a><div class="res-thread">
</div>
<div class="res-thread"><div class="res-block"><div class="res-
head"><span class="res-name">1: 名無しさん</span><span class="res-
datetime">25/12/08(月)20:42</span><span class="res-
likes"></span></div><div class="res-text">女子からしたら結構羨ましい立ち
位置なんだろうか</div></div>
<div class="res-replies"><div class="res-block res-reply"><div class
="res-head"><span class="res-name">21: 名無しさん</span><span class
="res-datetime">25/12/08(月)20:52</span><span class="res-likes">そうだね
x12</span></div><div class="res-text pink"><span class="res-
anchor">>>1</span><br />女子から嫌がらせされるくらいガチで嫌われてた
はず</div></div>
</div></div>
<div class="res-thread"><div class="res-block"><div class="res-
head"><span class="res-name">2: 名無しさん</span><span class="res-
datetime">25/12/08(月)20:43</span><span class="res-likes">そうだね
x5</span></div><div class="res-text purple">小毬ちゃん入るまで女友達0だっ
たからな…</div></div>
</div>
<div class="res-thread"><div class="res-block"><div class="res-
head"><span class="res-name">3: 名無しさん</span><span class="res-
datetime">25/12/08(月)20:43</span><span class="res-likes">そうだね
x25</span></div><div class="res-text red">本当に侍らせるのは理樹
</div></div>
</div>
<link href="https://nijigen-daily.com/nijigen_daily.css"
rel="stylesheet">
<a href="https://nijigen-daily.com/archives/12944238.html">続きを読む
</a>]]>
</content>
}}}
And we can say, “yes, thankfully someone indicates the encoding in the
attributes” because indeed, the content //is// HTML serialized inside XML,
not as XHTML but as an opaque text value of the element. However, earlier
in the same document we find this…
{{{
<summary type="text/plain">
>一体誰なのだ…だ…だれがいうかーーーーっ!! 風のようすがへんなのだ
雲じゃねーか! 新一という秘孔を突いた ユ… ユ…!! ゆうかーーーっ!!
ぬぅ!志村けんのカキタレ…!! |北斗の拳|ジャンプ|漫画・アニメ・ゲー
ム記事のまとめサイトならにじげん!デイリー
</summary>
}}}
So while this //positively identifies// the content as plaintext, we find
after properly decoding the XML text node that we //start// with `>一体
誰なのだ…` which almost //certainly// should start `>一体誰なのだ…`,
meaning the `type` should be `type="text/html" mode="escaped"`.
----
Another example is `https://www.gadgetguy.com.au/feed/`. Here we find
`<description>` with no attributes and it contains an encoded form of what
would parse equally as HTML or XHTML. Later in the same feed we find
`<content:encoded>` containing the same thing, which would seem to imply
that `CONTENT` is encoded but `DESCRIPTION` isn’t, but that’s not true.
This comes from WordPress.
----
For `https://www.tagesschau.de/index~atom.xml` we find this interesting
oddity:
{{{
<summary type="text/html" mode="escaped">Die USA haben…</summary>
<content mode="escaped"><![CDATA[<p> <a
href="https://www.tagesschau.de/ausland/amerika/trump-venezuela-
tanker-100.html"><img
src="https://images.tagesschau.de/image/0ff925d9-86a2-4888-9d98-56b86ee94412/AAABmwnwQxU/AAABmt42H9g/16x9-big/trump-4488.jpg?width=1920"
alt="Donald Trump | AFP" /></a> <br/> <br/>Die USA
haben…[<a href="https://www.tagesschau.de/ausland/amerika/trump-
venezuela-tanker-100.html">mehr</a>]</p>]]></content>
}}}
Here, both `SUMMARY` and `CONTENT` are `mode="escaped"`, but `SUMMARY` is
implied to be different, as `type="text/html"`. Ironically, it contains
//only// plaintext and lacks even a single character reference. Meanwhile,
the `CONTENT` actually has XML //double-encoded// as XML, which then
encodes HTML. This requires some level of recursion if not intending to
hard-code it.
{{{#!php
<?php
$content = get_content_element()->textContent;
$first_decode = html_entity_decode( $content, ENT_XML1 | ENT_SUBSTITUTE,
'UTF-8' );
$html = parse_xml( $first_decode )->textContent;
}}}
Most feeds, for `<content mode="escaped">` seem to produce this instead…
{{{
<![CDATA[<p> <a href="https://www.tagesschau.de/ausland/amerika/trump-
venezuela-tanker-100.html"><img
src="https://images.tagesschau.de/image/0ff925d9-86a2-4888-9d98-56b86ee94412/AAABmwnwQxU/AAABmt42H9g/16x9-big/trump-4488.jpg?width=1920"
alt="Donald Trump | AFP" /></a> <br/> <br/>Die USA haben vor der Küste
Venezuelas einen Tanker unter ihre Kontrolle gebracht. Das bestätigte US-
Präsident Trump. Seit Wochen erhöhen die USA den Druck auf Venezuela und
verlegen Seestreitkräfte in die Region.[<a
href="https://www.tagesschau.de/ausland/amerika/trump-venezuela-
tanker-100.html">mehr</a>]</p>]]>
}}}
We can note how someone took the intended serialized XML and then ran it
through something like `htmlspecialchars()` to hide it, much like what
happened as the motivating case for this ticket.
----
== Get on with it!
- `type=text/plain` //might// indicate that we should avoid decoding
//after deserializing from XML//.
- `mode="escaped"` doesn’t communicate anything, because //all// HTML
seems to be escaped, and if it’s missing that, it can only be plaintext or
embedded XHTML; however, if there are tag-like things, it’s almost
certainly XHTML. if, on the other hand, it’s missing the mode and there
are things which look like tags //after// unescaping, it’s probably
escaped anyway.
- this is the kind of thing that probably //has// to rely on some
heuristics based on the content in the item itself. feeds sometimes
aggregate items and encoding models may diverge within the same XML
document.
I hope to automate the scanning of all of the RSS feeds I downloaded,
including categorizing these into RSS vs. ATOM explorations, but that will
take more time than I had today. needless to say, I think the current
approach is failing us (parsing based on our inference of the
specifications). `SimplePie` is supposed to already decode and “sanitize”
content, and that causes confusion in the diverse world of feeds.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/63611#comment:34>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list