[wp-trac] [WordPress Trac] #63863: Standardize UTF-8 handling and fallbacks in 6.9

Sat Aug 23 00:34:26 UTC 2025

#63863: Standardize UTF-8 handling and fallbacks in 6.9
-------------------------+---------------------
 Reporter:  dmsnell      |       Owner:  (none)
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:  6.9
Component:  Formatting   |     Version:  trunk
 Severity:  normal       |  Resolution:
 Keywords:               |     Focuses:
-------------------------+---------------------
Description changed by dmsnell:

Old description:

> Core uses a wide variety of methods for working with UTF-8, especially so
> when the running server and process doesn’t have the `mbstring` extension
> loaded. Much of the diversity in implementation and specification in this
> code is the legacy of the time and context in which it was added. Today,
> it’s viable to unify and standardize code handling UTF-8, which is what
> this ticket proposes.
>
> **WordPress should unify handling of UTF-8 data** and depend foremost on
> the `mbstring`-provided functions which have received significant updates
> during the 8.x releases of PHP. When unavailable, WordPress should defer
> to a single and reasonable implementation of the functionality in pure
> PHP.
>
> === Why `mbstring`?
>
> The `mbstring` PHP extension has been
> [https://make.wordpress.org/hosting/handbook/server-environment/#php-
> extensions highly recommended] for a long time now and it provides high-
> performance and spec-compliant tools for working with UTF-8 and other
> text encodings. The extension has seen active development and integration
> with PHP throughout the PHP 8.x cycle, including improved Unicode support
> and a couple of notable enhancements for this work:
>
>  - Since PHP 8.1.6 the `mb_scrub()` function follows the Unicode
> [https://www.unicode.org/versions/Unicode16.0.0/core-
> spec/chapter-3/#G66453 substitution of maximal subparts]. This is an
> important aspect for UTF-8 decoders which ensures safe decoding of
> untrusted inputs //and// standardized interoperability with UTF-8
> handling in other systems.
>  - Since PHP 8.3.0 the UTF-8 validity of a string [https://github.com/php
> /php-src/commit/d0d834429f55053e827d9c1667d11efd33924cac is cached on the
> ZSTR value itself] and repeated checks are property-accesses for strings
> validated as UTF-8.
>  - These functions use high-performance implementations in C including
> the use of SIMD and vectorized algorithms whose speed cannot be matched
> in PHP and which aren’t deployed in every other potential method for
> working with UTF-8 (for example, in the `iconv` and `PCRE` libraries).
>
> The case for `mbstring` is strong. On the contrary side I believe that
> there are reasons to //avoid// other methods:
>  - `iconv()`’s behavior depends on the version of `iconv()` compiled in
> to the running PHP version and it’s hard to establish uniform behavior in
> the PHP code.
>  - `iconv()` cannot “scrub” invalid byte sequences from UTF-8 texts. It
> only supports an `//IGNORE` mode which removes byte sequences and this is
> not a sound security practice. It’s far more dangerous than substituting
> the sequences as is done with `mb_scrub()`.
>  - PCRE-based approaches are also highly system-dependent.
>  - Older versions of the PCRE libraries accepted certain invalid UTF-8
> sequences as valid, e.g. with five-byte sequences.
>  - PCRE-based approaches are memory heavy, most operating through one of
> two approaches: split the string into an array of chunks and then process
> the chunks; or use `preg_replace_callback()` and pass around sub-
> allocations of the input string.
>  - PCRE functions give no method for zero-allocation operation. This
> makes PCRE-based approaches vulnerable to denial-of-service attacks and
> accidents because it must match and extract a full group to operate on
> it. (For example, HTML numeric character references may contain an
> arbitrary number of leading zeros and can choke `preg_match()`, doubling
> memory use of more just to make the match.
>
> Finally, the plethora of approaches in Core can be dizzying when trying
> to understand behaviors and bugs. Without a well-defined specification of
> the behaviors, things only //seem// to work because the code does not
> explain the bounds of operation.
>
> === Why a single fallback in pure PHP?
>
> While a pure user-space PHP implementation of UTF-8 handling will never
> be able to approach the performance and efficiency of underlying native
> libraries, it can be reasonably fast and //fast enough// for reasonable
> use. Reportedly, around 0.5% of WordPress installations lack the
> `mbstring` extension. For those sites which lack the extension, WordPress
> can make a tradeoff between performance and harmony of behaviors.
>
> When functions try a chain of fallback options (e.g. `mbstring`, `iconv`,
> `PCRE` with Unicode support, `PCRE` with byte patterns hard-coded, ASCII)
> then it breaks apart WordPress’ reliability and makes for very difficult
> debugging and resolution.
>
> A pure PHP fallback gives WordPress the ability to identify and fix bugs
> in its implementation and the implementation details are visible for all
> to inspect. There’s a much higher barrier to try and diagnose why the
> functions return the wrong result when one needs to scan PHP’s source
> code, `iconv`’s source code, one of several PCRE libraries’ source code,
> and more.
>
> ----
>
> While the up-front effort is high, the HTML API has demonstrated how
> valuable it can be to have a reliable API in WordPress for interoperating
> with various web standards. It’s time to modernize WordPress to support
> UTF-8 universally and remove the existing complexity of ad-hoc handling
> and runtime dependencies.
>
> == Proposal
>
>  - Create a new `wp-includes/compat-utf8.php` polyfilling basic UTF-8
> handling and implementing the current `mb_` polyfills from `combat.php`.
> Moving this to a UTF-8-specific module keeps the code in that module
> focused and makes it easier to exclude the WPCS rule for rejecting `goto`
> statements. `goto` is a valuable construct when handling decoding errors
> in a low-level decoder. This module loads //before// `wp-
> includes/compat.php` which makes it simpler in that file to polypill
> things like `mb_substr()`.
>  - Create `wp-includes/utf8.php` containing WordPress-specific functions
> for handling UTF-8. This abstracts access to text behind a unifying
> interface and allows WordPress to improve support, performance, and
> reliability while lifting up all calling code. UTF-8 is universal enough
> to warrant its own subsystem.
>  - String functions are conditionally-defined based on the presence of
> the `mbstring` extension and any other relevant factors. This moves
> support-checks to PHP initialization instead of on every invocation of
> these functions.
>  - A new UTF-8 decoding pipeline provides zero-allocation, streaming, and
> re-entrant access to a string so that common operations don’t need to
> involve any more overhead than they require. In addition to being a
> versatile fallback mechanism, this low-level scanner can provide access
> to new abilities not available today such as: //count code points within
> a substring without allocating//, //split a string into chunks of valid
> and invalid byte sequences//, and //combine identification, validation,
> and transformation of a string into a single pass//.
>  - Replace existing non-canonical UTF-8 code in Core with the new
> abstractions. No more `static $utf8_pcre` checks, no more `if (
> function_exists( 'mb_substr' ) )` — just unconditional explicit
> semantics.
>  - Remove the single regex from the HTML API.
>
> As WordPress builds its own abstraction and polyfills for the `mbstring`
> library it can remove the fallback behaviors as it changes its minimum
> supported versions for PHP and if it starts requiring `mbstring`.
>
> == Related tickets
>
>  - #38044: Deprecate `seems_utf8()` and add `wp_is_valid_utf8()`.
>  - #62172: Deprecate non-UTF-8 support.
>  - #63837: Overhaul `wp_check_invalid_utf8()` to remove runtime
> dependencies.
>    - #29717: Optimize and fix `wp_check_invalid_utf8()`.
>    - #43224: Remove `$pcre_utf8` logic from `wp_check_invalid_utf8()`.
>  - #55603: Address deprecation of `utf8_decode()` and `utf8_encode()`,
> discussion of requiring `mbstring`.
>
> == Related PRs
>
>  - [https://github.com/WordPress/wordpress-develop/pull/6883 #6883]
> introduce custom UTF-8 decoding pipeline. (this PR was exploratory as
> part of background research).
>  - [https://github.com/WordPress/wordpress-develop/pull/9498 #9498]
> update `wp_check_invalid_utf8()` (currently contains broader updates
> which will be removed and transferred into a new PR).

New description:

 Core uses a wide variety of methods for working with UTF-8, especially so
 when the running server and process doesn’t have the `mbstring` extension
 loaded. Much of the diversity in implementation and specification in this
 code is the legacy of the time and context in which it was added. Today,
 it’s viable to unify and standardize code handling UTF-8, which is what
 this ticket proposes.

 **WordPress should unify handling of UTF-8 data** and depend foremost on
 the `mbstring`-provided functions which have received significant updates
 during the 8.x releases of PHP. When unavailable, WordPress should defer
 to a single and reasonable implementation of the functionality in pure
 PHP.

 === Why `mbstring`?

 The `mbstring` PHP extension has been
 [https://make.wordpress.org/hosting/handbook/server-environment/#php-
 extensions highly recommended] for a long time now and it provides high-
 performance and spec-compliant tools for working with UTF-8 and other text
 encodings. The extension has seen active development and integration with
 PHP throughout the PHP 8.x cycle, including improved Unicode support and a
 couple of notable enhancements for this work:

  - Since PHP 8.1.6 the `mb_scrub()` function follows the Unicode
 [https://www.unicode.org/versions/Unicode16.0.0/core-
 spec/chapter-3/#G66453 substitution of maximal subparts]. This is an
 important aspect for UTF-8 decoders which ensures safe decoding of
 untrusted inputs //and// standardized interoperability with UTF-8 handling
 in other systems.
  - Since PHP 8.3.0 the UTF-8 validity of a string [https://github.com/php
 /php-src/commit/d0d834429f55053e827d9c1667d11efd33924cac is cached on the
 ZSTR value itself] and repeated checks are property-accesses for strings
 validated as UTF-8.
  - These functions use high-performance implementations in C including the
 use of SIMD and vectorized algorithms whose speed cannot be matched in PHP
 and which aren’t deployed in every other potential method for working with
 UTF-8 (for example, in the `iconv` and `PCRE` libraries).

 The case for `mbstring` is strong. On the contrary side I believe that
 there are reasons to //avoid// other methods:
  - `iconv()`’s behavior depends on the version of `iconv()` compiled in to
 the running PHP version and it’s hard to establish uniform behavior in the
 PHP code.
  - `iconv()` cannot “scrub” invalid byte sequences from UTF-8 texts. It
 only supports an `//IGNORE` mode which removes byte sequences and this is
 not a sound security practice. It’s far more dangerous than substituting
 the sequences as is done with `mb_scrub()`.
  - PCRE-based approaches are also highly system-dependent.
  - Older versions of the PCRE libraries accepted certain invalid UTF-8
 sequences as valid, e.g. with five-byte sequences.
  - PCRE-based approaches are memory heavy, most operating through one of
 two approaches: split the string into an array of chunks and then process
 the chunks; or use `preg_replace_callback()` and pass around sub-
 allocations of the input string.
  - PCRE functions give no method for zero-allocation operation. This makes
 PCRE-based approaches vulnerable to denial-of-service attacks and
 accidents because it must match and extract a full group to operate on it.
 (For example, HTML numeric character references may contain an arbitrary
 number of leading zeros and can choke `preg_match()`, doubling memory use
 of more just to make the match.

 Finally, the plethora of approaches in Core can be dizzying when trying to
 understand behaviors and bugs. Without a well-defined specification of the
 behaviors, things only //seem// to work because the code does not explain
 the bounds of operation.

 === Why a single fallback in pure PHP?

 While a pure user-space PHP implementation of UTF-8 handling will never be
 able to approach the performance and efficiency of underlying native
 libraries, it can be reasonably fast and //fast enough// for reasonable
 use. Reportedly, around 0.5% of WordPress installations lack the
 `mbstring` extension. For those sites which lack the extension, WordPress
 can make a tradeoff between performance and harmony of behaviors.

 When functions try a chain of fallback options (e.g. `mbstring`, `iconv`,
 `PCRE` with Unicode support, `PCRE` with byte patterns hard-coded, ASCII)
 then it breaks apart WordPress’ reliability and makes for very difficult
 debugging and resolution.

 A pure PHP fallback gives WordPress the ability to identify and fix bugs
 in its implementation and the implementation details are visible for all
 to inspect. There’s a much higher barrier to try and diagnose why the
 functions return the wrong result when one needs to scan PHP’s source
 code, `iconv`’s source code, one of several PCRE libraries’ source code,
 and more.

 ----

 While the up-front effort is high, the HTML API has demonstrated how
 valuable it can be to have a reliable API in WordPress for interoperating
 with various web standards. It’s time to modernize WordPress to support
 UTF-8 universally and remove the existing complexity of ad-hoc handling
 and runtime dependencies.

 == Proposal

  - Create a new `wp-includes/compat-utf8.php` polyfilling basic UTF-8
 handling and implementing the current `mb_` polyfills from `combat.php`.
 Moving this to a UTF-8-specific module keeps the code in that module
 focused and makes it easier to exclude the WPCS rule for rejecting `goto`
 statements. `goto` is a valuable construct when handling decoding errors
 in a low-level decoder. This module loads //before// `wp-
 includes/compat.php` which makes it simpler in that file to polypill
 things like `mb_substr()`.
  - Create `wp-includes/utf8.php` containing WordPress-specific functions
 for handling UTF-8. This abstracts access to text behind a unifying
 interface and allows WordPress to improve support, performance, and
 reliability while lifting up all calling code. UTF-8 is universal enough
 to warrant its own subsystem.
  - String functions are conditionally-defined based on the presence of the
 `mbstring` extension and any other relevant factors. This moves support-
 checks to PHP initialization instead of on every invocation of these
 functions. A side-effect of splitting these functions based on the
 presence of the extension is the safe removal of
 `mbstring_binary_safe_encoding()`. When `mbstring` is loaded, functions
 will call the `mb_` functions directly; when it’s not available, there can
 be no `mbstring.func_overload`.
  - A new UTF-8 decoding pipeline provides zero-allocation, streaming, and
 re-entrant access to a string so that common operations don’t need to
 involve any more overhead than they require. In addition to being a
 versatile fallback mechanism, this low-level scanner can provide access to
 new abilities not available today such as: //count code points within a
 substring without allocating//, //split a string into chunks of valid and
 invalid byte sequences//, and //combine identification, validation, and
 transformation of a string into a single pass//.
  - Replace existing non-canonical UTF-8 code in Core with the new
 abstractions. No more `static $utf8_pcre` checks, no more `if (
 function_exists( 'mb_substr' ) )` — just unconditional explicit semantics.
  - Remove the single regex from the HTML API.

 As WordPress builds its own abstraction and polyfills for the `mbstring`
 library it can remove the fallback behaviors as it changes its minimum
 supported versions for PHP and if it starts requiring `mbstring`.

 == Related tickets

  - #38044: Deprecate `seems_utf8()` and add `wp_is_valid_utf8()`.
  - #62172: Deprecate non-UTF-8 support.
  - #63837: Overhaul `wp_check_invalid_utf8()` to remove runtime
 dependencies.
    - #29717: Optimize and fix `wp_check_invalid_utf8()`.
    - #43224: Remove `$pcre_utf8` logic from `wp_check_invalid_utf8()`.
  - #55603: Address deprecation of `utf8_decode()` and `utf8_encode()`,
 discussion of requiring `mbstring`.

 == Related PRs

  - [https://github.com/WordPress/wordpress-develop/pull/6883 #6883]
 introduce custom UTF-8 decoding pipeline. (this PR was exploratory as part
 of background research).
  - [https://github.com/WordPress/wordpress-develop/pull/9498 #9498] update
 `wp_check_invalid_utf8()` (currently contains broader updates which will
 be removed and transferred into a new PR).

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63863#comment:3>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform