[wp-trac] [WordPress Trac] #63863: Standardize UTF-8 handling and fallbacks in 6.9
WordPress Trac
noreply at wordpress.org
Sat Aug 23 00:34:26 UTC 2025
#63863: Standardize UTF-8 handling and fallbacks in 6.9
-------------------------+---------------------
Reporter: dmsnell | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: 6.9
Component: Formatting | Version: trunk
Severity: normal | Resolution:
Keywords: | Focuses:
-------------------------+---------------------
Description changed by dmsnell:
Old description:
> Core uses a wide variety of methods for working with UTF-8, especially so
> when the running server and process doesn’t have the `mbstring` extension
> loaded. Much of the diversity in implementation and specification in this
> code is the legacy of the time and context in which it was added. Today,
> it’s viable to unify and standardize code handling UTF-8, which is what
> this ticket proposes.
>
> **WordPress should unify handling of UTF-8 data** and depend foremost on
> the `mbstring`-provided functions which have received significant updates
> during the 8.x releases of PHP. When unavailable, WordPress should defer
> to a single and reasonable implementation of the functionality in pure
> PHP.
>
> === Why `mbstring`?
>
> The `mbstring` PHP extension has been
> [https://make.wordpress.org/hosting/handbook/server-environment/#php-
> extensions highly recommended] for a long time now and it provides high-
> performance and spec-compliant tools for working with UTF-8 and other
> text encodings. The extension has seen active development and integration
> with PHP throughout the PHP 8.x cycle, including improved Unicode support
> and a couple of notable enhancements for this work:
>
> - Since PHP 8.1.6 the `mb_scrub()` function follows the Unicode
> [https://www.unicode.org/versions/Unicode16.0.0/core-
> spec/chapter-3/#G66453 substitution of maximal subparts]. This is an
> important aspect for UTF-8 decoders which ensures safe decoding of
> untrusted inputs //and// standardized interoperability with UTF-8
> handling in other systems.
> - Since PHP 8.3.0 the UTF-8 validity of a string [https://github.com/php
> /php-src/commit/d0d834429f55053e827d9c1667d11efd33924cac is cached on the
> ZSTR value itself] and repeated checks are property-accesses for strings
> validated as UTF-8.
> - These functions use high-performance implementations in C including
> the use of SIMD and vectorized algorithms whose speed cannot be matched
> in PHP and which aren’t deployed in every other potential method for
> working with UTF-8 (for example, in the `iconv` and `PCRE` libraries).
>
> The case for `mbstring` is strong. On the contrary side I believe that
> there are reasons to //avoid// other methods:
> - `iconv()`’s behavior depends on the version of `iconv()` compiled in
> to the running PHP version and it’s hard to establish uniform behavior in
> the PHP code.
> - `iconv()` cannot “scrub” invalid byte sequences from UTF-8 texts. It
> only supports an `//IGNORE` mode which removes byte sequences and this is
> not a sound security practice. It’s far more dangerous than substituting
> the sequences as is done with `mb_scrub()`.
> - PCRE-based approaches are also highly system-dependent.
> - Older versions of the PCRE libraries accepted certain invalid UTF-8
> sequences as valid, e.g. with five-byte sequences.
> - PCRE-based approaches are memory heavy, most operating through one of
> two approaches: split the string into an array of chunks and then process
> the chunks; or use `preg_replace_callback()` and pass around sub-
> allocations of the input string.
> - PCRE functions give no method for zero-allocation operation. This
> makes PCRE-based approaches vulnerable to denial-of-service attacks and
> accidents because it must match and extract a full group to operate on
> it. (For example, HTML numeric character references may contain an
> arbitrary number of leading zeros and can choke `preg_match()`, doubling
> memory use of more just to make the match.
>
> Finally, the plethora of approaches in Core can be dizzying when trying
> to understand behaviors and bugs. Without a well-defined specification of
> the behaviors, things only //seem// to work because the code does not
> explain the bounds of operation.
>
> === Why a single fallback in pure PHP?
>
> While a pure user-space PHP implementation of UTF-8 handling will never
> be able to approach the performance and efficiency of underlying native
> libraries, it can be reasonably fast and //fast enough// for reasonable
> use. Reportedly, around 0.5% of WordPress installations lack the
> `mbstring` extension. For those sites which lack the extension, WordPress
> can make a tradeoff between performance and harmony of behaviors.
>
> When functions try a chain of fallback options (e.g. `mbstring`, `iconv`,
> `PCRE` with Unicode support, `PCRE` with byte patterns hard-coded, ASCII)
> then it breaks apart WordPress’ reliability and makes for very difficult
> debugging and resolution.
>
> A pure PHP fallback gives WordPress the ability to identify and fix bugs
> in its implementation and the implementation details are visible for all
> to inspect. There’s a much higher barrier to try and diagnose why the
> functions return the wrong result when one needs to scan PHP’s source
> code, `iconv`’s source code, one of several PCRE libraries’ source code,
> and more.
>
> ----
>
> While the up-front effort is high, the HTML API has demonstrated how
> valuable it can be to have a reliable API in WordPress for interoperating
> with various web standards. It’s time to modernize WordPress to support
> UTF-8 universally and remove the existing complexity of ad-hoc handling
> and runtime dependencies.
>
> == Proposal
>
> - Create a new `wp-includes/compat-utf8.php` polyfilling basic UTF-8
> handling and implementing the current `mb_` polyfills from `combat.php`.
> Moving this to a UTF-8-specific module keeps the code in that module
> focused and makes it easier to exclude the WPCS rule for rejecting `goto`
> statements. `goto` is a valuable construct when handling decoding errors
> in a low-level decoder. This module loads //before// `wp-
> includes/compat.php` which makes it simpler in that file to polypill
> things like `mb_substr()`.
> - Create `wp-includes/utf8.php` containing WordPress-specific functions
> for handling UTF-8. This abstracts access to text behind a unifying
> interface and allows WordPress to improve support, performance, and
> reliability while lifting up all calling code. UTF-8 is universal enough
> to warrant its own subsystem.
> - String functions are conditionally-defined based on the presence of
> the `mbstring` extension and any other relevant factors. This moves
> support-checks to PHP initialization instead of on every invocation of
> these functions.
> - A new UTF-8 decoding pipeline provides zero-allocation, streaming, and
> re-entrant access to a string so that common operations don’t need to
> involve any more overhead than they require. In addition to being a
> versatile fallback mechanism, this low-level scanner can provide access
> to new abilities not available today such as: //count code points within
> a substring without allocating//, //split a string into chunks of valid
> and invalid byte sequences//, and //combine identification, validation,
> and transformation of a string into a single pass//.
> - Replace existing non-canonical UTF-8 code in Core with the new
> abstractions. No more `static $utf8_pcre` checks, no more `if (
> function_exists( 'mb_substr' ) )` — just unconditional explicit
> semantics.
> - Remove the single regex from the HTML API.
>
> As WordPress builds its own abstraction and polyfills for the `mbstring`
> library it can remove the fallback behaviors as it changes its minimum
> supported versions for PHP and if it starts requiring `mbstring`.
>
> == Related tickets
>
> - #38044: Deprecate `seems_utf8()` and add `wp_is_valid_utf8()`.
> - #62172: Deprecate non-UTF-8 support.
> - #63837: Overhaul `wp_check_invalid_utf8()` to remove runtime
> dependencies.
> - #29717: Optimize and fix `wp_check_invalid_utf8()`.
> - #43224: Remove `$pcre_utf8` logic from `wp_check_invalid_utf8()`.
> - #55603: Address deprecation of `utf8_decode()` and `utf8_encode()`,
> discussion of requiring `mbstring`.
>
> == Related PRs
>
> - [https://github.com/WordPress/wordpress-develop/pull/6883 #6883]
> introduce custom UTF-8 decoding pipeline. (this PR was exploratory as
> part of background research).
> - [https://github.com/WordPress/wordpress-develop/pull/9498 #9498]
> update `wp_check_invalid_utf8()` (currently contains broader updates
> which will be removed and transferred into a new PR).
New description:
Core uses a wide variety of methods for working with UTF-8, especially so
when the running server and process doesn’t have the `mbstring` extension
loaded. Much of the diversity in implementation and specification in this
code is the legacy of the time and context in which it was added. Today,
it’s viable to unify and standardize code handling UTF-8, which is what
this ticket proposes.
**WordPress should unify handling of UTF-8 data** and depend foremost on
the `mbstring`-provided functions which have received significant updates
during the 8.x releases of PHP. When unavailable, WordPress should defer
to a single and reasonable implementation of the functionality in pure
PHP.
=== Why `mbstring`?
The `mbstring` PHP extension has been
[https://make.wordpress.org/hosting/handbook/server-environment/#php-
extensions highly recommended] for a long time now and it provides high-
performance and spec-compliant tools for working with UTF-8 and other text
encodings. The extension has seen active development and integration with
PHP throughout the PHP 8.x cycle, including improved Unicode support and a
couple of notable enhancements for this work:
- Since PHP 8.1.6 the `mb_scrub()` function follows the Unicode
[https://www.unicode.org/versions/Unicode16.0.0/core-
spec/chapter-3/#G66453 substitution of maximal subparts]. This is an
important aspect for UTF-8 decoders which ensures safe decoding of
untrusted inputs //and// standardized interoperability with UTF-8 handling
in other systems.
- Since PHP 8.3.0 the UTF-8 validity of a string [https://github.com/php
/php-src/commit/d0d834429f55053e827d9c1667d11efd33924cac is cached on the
ZSTR value itself] and repeated checks are property-accesses for strings
validated as UTF-8.
- These functions use high-performance implementations in C including the
use of SIMD and vectorized algorithms whose speed cannot be matched in PHP
and which aren’t deployed in every other potential method for working with
UTF-8 (for example, in the `iconv` and `PCRE` libraries).
The case for `mbstring` is strong. On the contrary side I believe that
there are reasons to //avoid// other methods:
- `iconv()`’s behavior depends on the version of `iconv()` compiled in to
the running PHP version and it’s hard to establish uniform behavior in the
PHP code.
- `iconv()` cannot “scrub” invalid byte sequences from UTF-8 texts. It
only supports an `//IGNORE` mode which removes byte sequences and this is
not a sound security practice. It’s far more dangerous than substituting
the sequences as is done with `mb_scrub()`.
- PCRE-based approaches are also highly system-dependent.
- Older versions of the PCRE libraries accepted certain invalid UTF-8
sequences as valid, e.g. with five-byte sequences.
- PCRE-based approaches are memory heavy, most operating through one of
two approaches: split the string into an array of chunks and then process
the chunks; or use `preg_replace_callback()` and pass around sub-
allocations of the input string.
- PCRE functions give no method for zero-allocation operation. This makes
PCRE-based approaches vulnerable to denial-of-service attacks and
accidents because it must match and extract a full group to operate on it.
(For example, HTML numeric character references may contain an arbitrary
number of leading zeros and can choke `preg_match()`, doubling memory use
of more just to make the match.
Finally, the plethora of approaches in Core can be dizzying when trying to
understand behaviors and bugs. Without a well-defined specification of the
behaviors, things only //seem// to work because the code does not explain
the bounds of operation.
=== Why a single fallback in pure PHP?
While a pure user-space PHP implementation of UTF-8 handling will never be
able to approach the performance and efficiency of underlying native
libraries, it can be reasonably fast and //fast enough// for reasonable
use. Reportedly, around 0.5% of WordPress installations lack the
`mbstring` extension. For those sites which lack the extension, WordPress
can make a tradeoff between performance and harmony of behaviors.
When functions try a chain of fallback options (e.g. `mbstring`, `iconv`,
`PCRE` with Unicode support, `PCRE` with byte patterns hard-coded, ASCII)
then it breaks apart WordPress’ reliability and makes for very difficult
debugging and resolution.
A pure PHP fallback gives WordPress the ability to identify and fix bugs
in its implementation and the implementation details are visible for all
to inspect. There’s a much higher barrier to try and diagnose why the
functions return the wrong result when one needs to scan PHP’s source
code, `iconv`’s source code, one of several PCRE libraries’ source code,
and more.
----
While the up-front effort is high, the HTML API has demonstrated how
valuable it can be to have a reliable API in WordPress for interoperating
with various web standards. It’s time to modernize WordPress to support
UTF-8 universally and remove the existing complexity of ad-hoc handling
and runtime dependencies.
== Proposal
- Create a new `wp-includes/compat-utf8.php` polyfilling basic UTF-8
handling and implementing the current `mb_` polyfills from `combat.php`.
Moving this to a UTF-8-specific module keeps the code in that module
focused and makes it easier to exclude the WPCS rule for rejecting `goto`
statements. `goto` is a valuable construct when handling decoding errors
in a low-level decoder. This module loads //before// `wp-
includes/compat.php` which makes it simpler in that file to polypill
things like `mb_substr()`.
- Create `wp-includes/utf8.php` containing WordPress-specific functions
for handling UTF-8. This abstracts access to text behind a unifying
interface and allows WordPress to improve support, performance, and
reliability while lifting up all calling code. UTF-8 is universal enough
to warrant its own subsystem.
- String functions are conditionally-defined based on the presence of the
`mbstring` extension and any other relevant factors. This moves support-
checks to PHP initialization instead of on every invocation of these
functions. A side-effect of splitting these functions based on the
presence of the extension is the safe removal of
`mbstring_binary_safe_encoding()`. When `mbstring` is loaded, functions
will call the `mb_` functions directly; when it’s not available, there can
be no `mbstring.func_overload`.
- A new UTF-8 decoding pipeline provides zero-allocation, streaming, and
re-entrant access to a string so that common operations don’t need to
involve any more overhead than they require. In addition to being a
versatile fallback mechanism, this low-level scanner can provide access to
new abilities not available today such as: //count code points within a
substring without allocating//, //split a string into chunks of valid and
invalid byte sequences//, and //combine identification, validation, and
transformation of a string into a single pass//.
- Replace existing non-canonical UTF-8 code in Core with the new
abstractions. No more `static $utf8_pcre` checks, no more `if (
function_exists( 'mb_substr' ) )` — just unconditional explicit semantics.
- Remove the single regex from the HTML API.
As WordPress builds its own abstraction and polyfills for the `mbstring`
library it can remove the fallback behaviors as it changes its minimum
supported versions for PHP and if it starts requiring `mbstring`.
== Related tickets
- #38044: Deprecate `seems_utf8()` and add `wp_is_valid_utf8()`.
- #62172: Deprecate non-UTF-8 support.
- #63837: Overhaul `wp_check_invalid_utf8()` to remove runtime
dependencies.
- #29717: Optimize and fix `wp_check_invalid_utf8()`.
- #43224: Remove `$pcre_utf8` logic from `wp_check_invalid_utf8()`.
- #55603: Address deprecation of `utf8_decode()` and `utf8_encode()`,
discussion of requiring `mbstring`.
== Related PRs
- [https://github.com/WordPress/wordpress-develop/pull/6883 #6883]
introduce custom UTF-8 decoding pipeline. (this PR was exploratory as part
of background research).
- [https://github.com/WordPress/wordpress-develop/pull/9498 #9498] update
`wp_check_invalid_utf8()` (currently contains broader updates which will
be removed and transferred into a new PR).
--
--
Ticket URL: <https://core.trac.wordpress.org/ticket/63863#comment:3>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list