[wp-trac] [WordPress Trac] #59883: Remove support for HTML4 and XHTML
WordPress Trac
noreply at wordpress.org
Wed Apr 3 16:29:02 UTC 2024
#59883: Remove support for HTML4 and XHTML
-------------------------------------------------+-------------------------
Reporter: dmsnell | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: Awaiting
| Review
Component: HTML API | Version: 6.5
Severity: normal | Resolution:
Keywords: dev-feedback has-dev-note 2nd- | Focuses:
opinion |
-------------------------------------------------+-------------------------
Description changed by dmsnell:
Old description:
> == Summary
>
> WordPress still officially supports HTML4 and XHTML, but the browsers it
> serves and the broader web effectively don't. Let's remove support so
> that we can modernize the code we write and simplify Core's HTML-handling
> functionality.
>
> == Background
>
> This came up recently in #58664 and in an exploration
> [https://github.com/WordPress/wordpress-develop/pull/5337 rewriting
> esc_attr()].
>
> In various places WordPress maintains the appearance of supporting HTML4,
> for example:
>
> - `wp_kses_named_entities()` rejects valid named character references
> like `⇵` and in turn corrupts documents containing these
> entities.
> - script and style tags conditionally add `type` attributes that never
> need to be printed
> - widgets selectively render `<nav>` and strip tags out of the `$title`
> for a page when TITLE elements can contain no tags anyway. This leads to
> corruption in the page title for removing what WordPress thinks are tags
> but aren't.
> - various places run `kses` as if serving XHTML, adding needless invalid
> syntax like the self-closing flag on void elements, e.g. `<img />`, `<br
> />`, `<meta />`
>
> The //appearance// of serving HTML4 or XHTML stems from the fact that
> it's very rare to serve actual XHTML content, and perhaps impossible to
> serve HTML4 content, to any supported browser or environment.
>
> - browsers ignore any `<xml>` or `<!DOCTYPE>` declaration specifying
> HTML4 or XHTML. They interpret a page as HTML5 regardless. You can
> confirm this by visiting a page with the `〈` named character
> reference. If interpreted as HTML4 it will transform into the U+2329 `〈`
> code point, but if interpreted as HTML5 will transform into the U+27E8
> codepoint `⟨`.
> - the only way to serve a page as XHTML is to send the HTTP header
> `Content-type: application/xhtml+xml` or to serve the page with the
> `.xml` file extension in the URL (e.g. serve `index.xml` instead of
> `index.html` or `index.php` or `/index` or `/`). It's not enough to send
> a `<meta http-equiv="content-type" content="application/xhtml+xml">` tag;
> it //must// come through the HTTP headers.
>
> Because of this behavior in browsers, WordPress sends content that it
> thinks is one thing but is received as another. Removing official support
> means that we can start to remove those places that purport to send HTML4
> or XHTML content when that assumption is wrong and can lead to data
> corruption, let alone needless syntax noise.
>
> WordPress still serves XML content in RSS feeds; this proposal does not
> recommend removing support for generating the XML feeds, but it may
> extend to the escaping and rendering of embedded HTML within those feeds,
> since an RSS reader is unlikely to and should not be interpreting
> embedded HTML as HTML4 and should be supporting embedded HTML5 as any web
> browser would. As an embedding, the content rendered into the feed
> remains separate from the surrounding RSS XML container.
>
> == Action plan
>
> Removing support for HTML4 and XHTML doesn't require any immediate action
> because HTML5 parsers compliantly parse HTML4 and XHTML up to their
> conflicting rules, such as with the `〈` named character reference.
> Since WordPress is already "broken" in this sense today, removing support
> does not imply that these are new bugs; rather it acknowledges that we
> missed updating WordPress once HTML4 and XHTML properly disappeared.
>
> In future work it opens up opportunities to modernize WordPress:
> - we don't need to handle complicated corner cases where pre-HTML5
> renders require special cases.
> - we can remove code meant for backwards compatibility which no longer
> provides that support.
> - we can update Core functions such as `_wp_kses_named_entities()` to
> prevent them from corrupting data based on inaccurate parsing rules from
> the past.
> - we can define a body of support and scope for what WordPress will and
> won't attempt to clean up. Functions like `force_balance_tags()` and
> encoding functions attempt to normalize and sanitize HTML but just as
> often further break that HTML when passing it through to the browser
> would have a deterministic and safe resolution.
> - we can eliminate wrapping script output with CDATA escaping which is
> only needed for XML compatibility.
> - we can use HTML5 form validation by default in more places instead of
> requiring an opt-in.
>
> The HTML API is providing WordPress the ability to have a smarter Core
> HTML system that won't be confused by rare or unexpected inputs and leans
> heavily on a spec-compliant "garbage-in garbage-out" approach. This
> dramatically simplifies HTML processing code without opening unsafe
> avenues; this is because HTML5 defines how to handle abnormal inputs.
>
> Weston [https://github.com/GoogleChromeLabs/wpp-research/pull/74 queried
> the HTTP Archive] and found up to potentially two sites among millions
> that are serving XHTML content through the inclusion of proper HTTP
> headers.
>
> == Linked Issues
>
> - #60320 the `CDATA` wrappers around inline JavaScript break non-
> JavaScript `SCRIPT` contents.
New description:
== Summary
WordPress still officially supports HTML4 and XHTML, but the browsers it
serves and the broader web effectively don't. Let's remove support so that
we can modernize the code we write and simplify Core's HTML-handling
functionality.
== Background
This came up recently in #58664 and in an exploration
[https://github.com/WordPress/wordpress-develop/pull/5337 rewriting
esc_attr()].
In various places WordPress maintains the appearance of supporting HTML4,
for example:
- `wp_kses_named_entities()` rejects valid named character references like
`⇵` and in turn corrupts documents containing these
entities.
- script and style tags conditionally add `type` attributes that never
need to be printed
- widgets selectively render `<nav>` and strip tags out of the `$title`
for a page when TITLE elements can contain no tags anyway. This leads to
corruption in the page title for removing what WordPress thinks are tags
but aren't.
- various places run `kses` as if serving XHTML, adding needless invalid
syntax like the self-closing flag on void elements, e.g. `<img />`, `<br
/>`, `<meta />`
The //appearance// of serving HTML4 or XHTML stems from the fact that it's
very rare to serve actual XHTML content, and perhaps impossible to serve
HTML4 content, to any supported browser or environment.
- browsers ignore any `<xml>` or `<!DOCTYPE>` declaration specifying HTML4
or XHTML. They interpret a page as HTML5 regardless. You can confirm this
by visiting a page with the `〈` named character reference. If
interpreted as HTML4 it will transform into the U+2329 `〈` code point,
but if interpreted as HTML5 will transform into the U+27E8 codepoint `⟨`.
- the only way to serve a page as XHTML is to send the HTTP header
`Content-type: application/xhtml+xml` or to serve the page with the `.xml`
file extension in the URL (e.g. serve `index.xml` instead of `index.html`
or `index.php` or `/index` or `/`). It's not enough to send a `<meta http-
equiv="content-type" content="application/xhtml+xml">` tag; it //must//
come through the HTTP headers.
Because of this behavior in browsers, WordPress sends content that it
thinks is one thing but is received as another. Removing official support
means that we can start to remove those places that purport to send HTML4
or XHTML content when that assumption is wrong and can lead to data
corruption, let alone needless syntax noise.
WordPress still serves XML content in RSS feeds; this proposal does not
recommend removing support for generating the XML feeds, but it may extend
to the escaping and rendering of embedded HTML within those feeds, since
an RSS reader is unlikely to and should not be interpreting embedded HTML
as HTML4 and should be supporting embedded HTML5 as any web browser would.
As an embedding, the content rendered into the feed remains separate from
the surrounding RSS XML container.
== Action plan
Removing support for HTML4 and XHTML doesn't require any immediate action
because HTML5 parsers compliantly parse HTML4 and XHTML up to their
conflicting rules, such as with the `〈` named character reference.
Since WordPress is already "broken" in this sense today, removing support
does not imply that these are new bugs; rather it acknowledges that we
missed updating WordPress once HTML4 and XHTML properly disappeared.
In future work it opens up opportunities to modernize WordPress:
- we don't need to handle complicated corner cases where pre-HTML5 renders
require special cases.
- we can remove code meant for backwards compatibility which no longer
provides that support.
- we can update Core functions such as `_wp_kses_named_entities()` to
prevent them from corrupting data based on inaccurate parsing rules from
the past.
- we can define a body of support and scope for what WordPress will and
won't attempt to clean up. Functions like `force_balance_tags()` and
encoding functions attempt to normalize and sanitize HTML but just as
often further break that HTML when passing it through to the browser would
have a deterministic and safe resolution.
- we can eliminate wrapping script output with CDATA escaping which is
only needed for XML compatibility.
- we can use HTML5 form validation by default in more places instead of
requiring an opt-in.
The HTML API is providing WordPress the ability to have a smarter Core
HTML system that won't be confused by rare or unexpected inputs and leans
heavily on a spec-compliant "garbage-in garbage-out" approach. This
dramatically simplifies HTML processing code without opening unsafe
avenues; this is because HTML5 defines how to handle abnormal inputs.
Weston [https://github.com/GoogleChromeLabs/wpp-research/pull/74 queried
the HTTP Archive] and found up to potentially two sites among millions
that are serving XHTML content through the inclusion of proper HTTP
headers.
== Linked Issues
- #60320 the `CDATA` wrappers around inline JavaScript break non-
JavaScript `SCRIPT` contents.
- [https://github.com/WordPress/wpcs-docs/pull/136 wpcs-docs#136] XHTML
and HTML conflicts
--
--
Ticket URL: <https://core.trac.wordpress.org/ticket/59883#comment:5>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list