[wp-trac] [WordPress Trac] #62172: Deprecate non-UTF-8 Support

Sat Oct 5 19:18:14 UTC 2024

#62172: Deprecate non-UTF-8 Support
-------------------------+-----------------------------
 Reporter:  dmsnell      |       Owner:  (none)
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:  Future Release
Component:  General      |     Version:  trunk
 Severity:  normal       |  Resolution:
 Keywords:               |     Focuses:
-------------------------+-----------------------------

Old description:

> WordPress' code and history is full of ambiguity on character encoding.
> When WordPress was formed, many websites and systems still used various
> single-byte region-specific text encodings, and some used more
> complicated shifting encodings, but today, UTF-8 is near-universal and
> the standard recommendation for interoperability between systems.
>
> Significant complexity in WordPress codebase exists in an attempt to
> properly handle various character encodings. Unfortunately, in many (if
> not most) of these cases, the code is confused on what strings are in
> what encodings and how those need to be transformed in order to make
> proper sense of it all.
>
> Furthermore, the `blog_charset` appears to have been introduced for the
> purpose of writing a `<meta>` tag on rendered pages to let a browser know
> what encoding to expect, while WordPress itself was to remain agnostic
> with regards to that same encoding. Over time, the option has been used
> as a mechanism for indicating how to transform strings, which doesn't
> resolve any of the problems introduced by working with multiple character
> encodings. (thanks @mdawaffe for the history there).
>
> In any given WordPress request:
>  - data in the database is stored in one of two ways: either as encoded
> text in some character encoding, or as the raw bytes of some encoding
> which are mislabeled as `latin1` so that MySQL doesn't attempt to
> interpret the bytes.
>  - data is read from MySQL and possibly transformed from the stored bytes
> into a connection/session-determined encoding and collation, unless a
> query-specified encoding is also provided.
>  - PHP source code is stored as UTF-8 or is US-ASCII compatible, making
> string-based operations against possibly-transformed data from the
> database.
>  - Various PHP code will read the currently-set locale or
> `default_charset`, `input_encoding`, `output_encoding`, or
> `internal_encoding` and operate differently because of an assumption that
> the bytes on which they are operating is in those other encodings.
>  - Files are read from the filesystem which are probably encoded in
> UTF-8.
>  - Query args are parsed and percent-escaping is decoded, whose source
> encoding is not guaranteed to be UTF-8.
>  - POST arguments are read, parsed, and percent-decoded, again without
> clarity on which byte encoding they are escaping.
>  - HTML named character references are encoded and decoded, which
> translate into different byte sequences based on the configured character
> encoding, often set by `blog_charset`.
>  - Various filters and functions in Core, like `wp_spaces_regex()`
> examine specific byte sequences, which are UTF-8-specific, against
> strings which may have the same character sequence but in a different
> byte sequence.
>  - Network requests might be made, which are read and parsed, which may
> come in different encodings according to the `Content-type`.
>  - HTML is sent to the browser and a `<meta charset="">` tag is produced
> to instruct the browser how to interpret the bytes it receives. This may
> or may not match the HTML which WordPress is generating, as most block
> code and most filters are hard-coded PHP strings in UTF-8 or are at least
> isomorphic to it up to US-ASCII.
>
> So as is the case with deprecating XHTML and HTML4 support, deprecating
> UTF-8 is mostly about being honest with ourselves and making space
> officially to remove complex and risky parts of the codebase that often
> do more harm and help. There's a good chance today that WordPress is
> already extremely fragile when working with non-UTF-8 systems, and
> deprecating it would make it possible to fix those existing issues.
>
> Deprecating UTF-8 means WordPress can stop attempting to support an
> N-to-M text-encoding architecture and replace it with an N-to-1
> architecture, where strings that need to be converted are converted at
> the boundary of the system while everything inside the system is UTF-8,
> harmonizing all of the different levels of encoding and code.

New description:

 WordPress' code and history is full of ambiguity on character encoding.
 When WordPress was formed, many websites and systems still used various
 single-byte region-specific text encodings, and some used more complicated
 shifting encodings, but today, UTF-8 is near-universal and the standard
 recommendation for interoperability between systems.

 Significant complexity in WordPress codebase exists in an attempt to
 properly handle various character encodings. Unfortunately, in many (if
 not most) of these cases, the code is confused on what strings are in what
 encodings and how those need to be transformed in order to make proper
 sense of it all.

 Furthermore, the `blog_charset` appears to have been introduced for the
 purpose of writing a `<meta>` tag on rendered pages to let a browser know
 what encoding to expect, while WordPress itself was to remain agnostic
 with regards to that same encoding. Over time, the option has been used as
 a mechanism for indicating how to transform strings, which doesn't resolve
 any of the problems introduced by working with multiple character
 encodings. (thanks @mdawaffe for the history there).

 In any given WordPress request:
  - data in the database is stored in one of two ways: either as encoded
 text in some character encoding, or as the raw bytes of some encoding
 which are mislabeled as `latin1` so that MySQL doesn't attempt to
 interpret the bytes.
  - data is read from MySQL and possibly transformed from the stored bytes
 into a connection/session-determined encoding and collation, unless a
 query-specified encoding is also provided.
  - PHP source code is stored as UTF-8 or is US-ASCII compatible, making
 string-based operations against possibly-transformed data from the
 database.
  - Various PHP code will read the currently-set locale or
 `default_charset`, `input_encoding`, `output_encoding`, or
 `internal_encoding` and operate differently because of an assumption that
 the bytes on which they are operating is in those other encodings.
  - Files are read from the filesystem which are probably encoded in UTF-8.
  - Query args are parsed and percent-escaping is decoded, whose source
 encoding is not guaranteed to be UTF-8.
  - POST arguments are read, parsed, and percent-decoded, again without
 clarity on which byte encoding they are escaping.
  - HTML named character references are encoded and decoded, which
 translate into different byte sequences based on the configured character
 encoding, often set by `blog_charset`.
  - Various filters and functions in Core, like `wp_spaces_regex()` examine
 specific byte sequences, which are UTF-8-specific, against strings which
 may have the same character sequence but in a different byte sequence.
  - Network requests might be made, which are read and parsed, which may
 come in different encodings according to the `Content-type`.
  - HTML is sent to the browser and a `<meta charset="">` tag is produced
 to instruct the browser how to interpret the bytes it receives. This may
 or may not match the HTML which WordPress is generating, as most block
 code and most filters are hard-coded PHP strings in UTF-8 or are at least
 isomorphic to it up to US-ASCII.

 So as is the case with deprecating XHTML and HTML4 support, deprecating
 non-UTF-8 support is mostly about being honest with ourselves and making
 space officially to remove complex and risky parts of the codebase that
 often do more harm and help. There's a good chance today that WordPress is
 already extremely fragile when working with non-UTF-8 systems, and
 deprecating it would make it possible to fix those existing issues.

 Deprecating non-UTF-8 support means WordPress can stop attempting to
 support an N-to-M text-encoding architecture and replace it with an N-to-1
 architecture, where strings that need to be converted are converted at the
 boundary of the system while everything inside the system is UTF-8,
 harmonizing all of the different levels of encoding and code.

--

Comment (by dmsnell):

 Updated to fix the inverted deprecation (let's go back to US-ASCII-only
 🙃), and thanks @mdawaffe!

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/62172#comment:2>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform