[wp-trac] [WordPress Trac] #64473: Embrace WHATWG Encoding Standards

WordPress Trac noreply at wordpress.org
Sun Jan 4 08:19:01 UTC 2026


#64473: Embrace WHATWG Encoding Standards
-------------------------+-----------------------
 Reporter:  dmsnell      |       Owner:  dmsnell
     Type:  enhancement  |      Status:  assigned
 Priority:  normal       |   Milestone:  7.0
Component:  Charset      |     Version:
 Severity:  normal       |  Resolution:
 Keywords:  has-patch    |     Focuses:
-------------------------+-----------------------
Description changed by dmsnell:

Old description:

> Text encoding can be extremely complicated. Worse, it can draw in a wide
> array of security issues. Because of this complexity and because of the
> issues which arise when different systems interpret the same text
> differently, even through such basic actions as using text decoders which
> have different internal behaviors, the WHATWG established the
> [https://encoding.spec.whatwg.org/ Encoding standard].
>
> This specification standardizes many different aspects of the text data
> flow, including, but not limited to:
>  - How can the encoding for a stream of bytes be guessed?
>  - When someone says their text is “1252” or “UTF7” or “UTF-8;ASCII” or
> any number of invalid or non-standard declarations, what should the
> system pick as the correct encoding declaration?
>  - How should certain security-sensitive encodings be handled?
>  - How exactly should certain kinds of errors be handled when decoding
> multibyte characters?
>
> It also strongly asserts that all systems should ideally use UTF-8 (see
> #62172).
>
> ----
>
> The specification is rather short and would provide considerable value to
> the tricky parts of WordPress’ encoding woes.
>
> It should be designed in a way to answer questions that developers have
> when using WordPress, touching notable parts such as:
>
>  - Parsing HTML when an encoding is uncertain or unknown.
>  - Converting text from the database to HTML.
>  - Converting text when exporting to WXR.
>  - Converting text when existing decoders aren’t available (polyfilling
> conversion).
>  - Providing security-sensitive aids to text-handling code.
>
> == Related Tickets
>
>  - #7813, #38479, #39190 export functions need reliable conversion from a
> likely-unknown legacy encoding into UTF-8 (and //not// `utf8_encode()` —
> see #55603).
>  - #20368 `htmlspecialchars()` woes when charset not provided. a separate
> issue, but the proposed patch includes a simplified form of a
> `name_from_label` table.
>  - #49355 seems like posts can fail to save into a database when supplied
> invalid encodings. this is a bigger issue requiring coordination with
> `wpdb`

New description:

 Text encoding can be extremely complicated. Worse, it can draw in a wide
 array of security issues. Because of this complexity and because of the
 issues which arise when different systems interpret the same text
 differently, even through such basic actions as using text decoders which
 have different internal behaviors, the WHATWG established the
 [https://encoding.spec.whatwg.org/ Encoding standard].

 This specification standardizes many different aspects of the text data
 flow, including, but not limited to:
  - How can the encoding for a stream of bytes be guessed?
  - When someone says their text is “1252” or “UTF7” or “UTF-8;ASCII” or
 any number of invalid or non-standard declarations, what should the system
 pick as the correct encoding declaration?
  - How should certain security-sensitive encodings be handled?
  - How exactly should certain kinds of errors be handled when decoding
 multibyte characters?

 It also strongly asserts that all systems should ideally use UTF-8 (see
 #62172).

 ----

 The specification is rather short and would provide considerable value to
 the tricky parts of WordPress’ encoding woes.

 It should be designed in a way to answer questions that developers have
 when using WordPress, touching notable parts such as:

  - Parsing HTML when an encoding is uncertain or unknown.
  - Converting text from the database to HTML.
  - Converting text when exporting to WXR.
  - Converting text when existing decoders aren’t available (polyfilling
 conversion).
  - Providing security-sensitive aids to text-handling code.

 == Related Tickets

  - #7813, #38479, #39190 export functions need reliable conversion from a
 likely-unknown legacy encoding into UTF-8 (and //not// `utf8_encode()` —
 see #55603).
  - #20368 `htmlspecialchars()` woes when charset not provided. a separate
 issue, but the proposed patch includes a simplified form of a
 `name_from_label` table.
  - #49355 seems like posts can fail to save into a database when supplied
 invalid encodings. this is a bigger issue requiring coordination with
 `wpdb`
  - #63864 MIME decoding from email should be cautious about what it
 decodes

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/64473#comment:3>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list