[wp-trac] [WordPress Trac] #62091: XML API: Produce XML Serialization of HTML (XHTML)

WordPress Trac noreply at wordpress.org
Tue Mar 31 01:32:23 UTC 2026


#62091: XML API: Produce XML Serialization of HTML (XHTML)
-----------------------------+-----------------------------
 Reporter:  dmsnell          |       Owner:  (none)
     Type:  feature request  |      Status:  new
 Priority:  normal           |   Milestone:  Future Release
Component:  HTML API         |     Version:  6.7
 Severity:  normal           |  Resolution:
 Keywords:  has-patch        |     Focuses:
-----------------------------+-----------------------------
Description changed by dmsnell:

Old description:

> Even though XML cannot represent all possible HTML documents, and even
> though it's dangerous to send XHTML content generally, there are
> extremely rare cases where it's useful to directly embed an HTML document
> into an existing XML document, if the given document //can// be expressed
> in XML.
>
> === What is required to transform when converting HTML to XML? ===
>
>  * HTML void elements like `<img>` should adopt the self-closing flag to
> become `<img />`
>  * HTML text should be decoded and then only `<`, `>`, `&`, `"`, and `'`
> ought to be re-encoded.
>  * Namespace transitions should involve changes to the default namespace.
>     * When entering a foreign element (`SVG` and `MATH`).
>     * When returning to HTML from a foreign element.
>     * When entering HTML integration points, such as `FOREIGNOBJECT` and
> `ANNOTATION-XML` with the proper attribute.
>     * Containing element needs namespace prefix on tag name, e.g.
> `<svg:svg>` or `<svg:foreignElement>`, and then we can update the default
> namespace on that element, but because the default namespace doesn't
> apply to attributes, and because namespaced attributes are different than
> non-namespaced attributes, we must leave the attributes un-namespaced.
>  * Something has to be done about un-representable characters.
>     * Invalid UTF-8 bytes.
>     * Unicode non-characters and other disallowed characters.
>  * HTML documents which cannot be represented in XML should result in
> rejection - cannot serialize.
>  * The HTML doctype declaration probably needs to be removed.
>
> === Design ===
>
> With the introduction of `WP_HTML_Processor::serialize()` in #62036, an
> XML serialization might appear naturally as
> `WP_HTML_Processor::serialize_to_xml()`. When parsing as a fragment, the
> output may be an XML fragment, while a full parser would produce a valid
> XHTML document including the XML declaration.
>
> ----
>
> Please share your thoughts if you know of other transformations that need
> to occur.
>
> ----
>
> XML and HTML are divergent languages. You probably don't want XHTML. It's
> dangerous.

New description:

 Even though XML cannot represent all possible HTML documents, and even
 though it's dangerous to send XHTML content generally, there are extremely
 rare cases where it's useful to directly embed an HTML document into an
 existing XML document, if the given document //can// be expressed in XML.

 === What is required to transform when converting HTML to XML? ===

  * HTML void elements like `<img>` should adopt the self-closing flag to
 become `<img />`
  * HTML text should be decoded and then only `<`, `>`, `&`, `"`, and `'`
 ought to be re-encoded.
  * Namespace transitions should involve changes to the default namespace.
     * When entering a foreign element (`SVG` and `MATH`).
     * When returning to HTML from a foreign element.
     * When entering HTML integration points, such as `FOREIGNOBJECT` and
 `ANNOTATION-XML` with the proper attribute.
     * Containing element needs namespace prefix on tag name, e.g.
 `<svg:svg>` or `<svg:foreignElement>`, and then we can update the default
 namespace on that element, but because the default namespace doesn't apply
 to attributes, and because namespaced attributes are different than non-
 namespaced attributes, we must leave the attributes un-namespaced.
  * Something has to be done about un-representable characters.
     * Invalid UTF-8 bytes.
     * Unicode non-characters and other disallowed characters.
  * HTML documents which cannot be represented in XML should result in
 rejection - cannot serialize.
  * The HTML doctype declaration probably needs to be removed.

 === Design ===

 With the introduction of `WP_HTML_Processor::serialize()` in #62036, an
 XML serialization might appear naturally as
 `WP_HTML_Processor::serialize_to_xml()`. When parsing as a fragment, the
 output may be an XML fragment, while a full parser would produce a valid
 XHTML document including the XML declaration.

 ----

 Please share your thoughts if you know of other transformations that need
 to occur.

 ----

 XML and HTML are divergent languages. You probably don't want XHTML. It's
 dangerous.

 == Related work and tickets

  - #19998 obviates the need for `esc_xml()`, which is semantically
 ambiguous.
  - #39190 invalid UTF-8 appearing in RSS feed.
  - #49730 Twemoji shouldn’t replace characters in XML documents.

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/62091#comment:6>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list