[wp-trac] [WordPress Trac] #63863: Standardize UTF-8 handling and fallbacks in 6.9
WordPress Trac
noreply at wordpress.org
Sat Aug 23 00:07:03 UTC 2025
#63863: Standardize UTF-8 handling and fallbacks in 6.9
-------------------------+--------------------
Reporter: dmsnell | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: 6.9
Component: Formatting | Version: trunk
Severity: normal | Keywords:
Focuses: |
-------------------------+--------------------
Core uses a wide variety of methods for working with UTF-8, especially so
when the running server and process doesn’t have the `mbstring` extension
loaded. Much of the diversity in implementation and specification in this
code is the legacy of the time and context in which it was added. Today,
it’s viable to unify and standardize code handling UTF-8, which is what
this ticket proposes.
**WordPress should unify handling of UTF-8 data** and depend foremost on
the `mbstring`-provided functions which have received significant updates
during the 8.x releases of PHP. When unavailable, WordPress should defer
to a single and reasonable implementation of the functionality in pure
PHP.
=== Why `mbstring`?
The `mbstring` PHP extension has been
[https://make.wordpress.org/hosting/handbook/server-environment/#php-
extensions highly recommended] for a long time now and it provides high-
performance and spec-compliant tools for working with UTF-8 and other text
encodings. The extension has seen active development and integration with
PHP throughout the PHP 8.x cycle, including improved Unicode support and a
couple of notable enhancements for this work:
- Since PHP 8.1.6 the `mb_scrub()` function follows the Unicode
[https://www.unicode.org/versions/Unicode16.0.0/core-
spec/chapter-3/#G66453 substitution of maximal subparts]. This is an
important aspect for UTF-8 decoders which ensures safe decoding of
untrusted inputs //and// standardized interoperability with UTF-8 handling
in other systems.
- Since PHP 8.3.0 the UTF-8 validity of a string [https://github.com/php
/php-src/commit/d0d834429f55053e827d9c1667d11efd33924cac is cached on the
ZSTR value itself] and repeated checks are property-accesses for strings
validated as UTF-8.
- These functions use high-performance implementations in C including the
use of SIMD and vectorized algorithms whose speed cannot be matched in PHP
and which aren’t deployed in every other potential method for working with
UTF-8 (for example, in the `iconv` and `PCRE` libraries).
The case for `mbstring` is strong. On the contrary side I believe that
there are reasons to //avoid// other methods:
- `iconv()`’s behavior depends on the version of `iconv()` compiled in to
the running PHP version and it’s hard to establish uniform behavior in the
PHP code.
- `iconv()` cannot “scrub” invalid byte sequences from UTF-8 texts. It
only supports an `//IGNORE` mode which removes byte sequences and this is
not a sound security practice. It’s far more dangerous than substituting
the sequences as is done with `mb_scrub()`.
- PCRE-based approaches are also highly system-dependent.
- Older versions of the PCRE libraries accepted certain invalid UTF-8
sequences as valid, e.g. with five-byte sequences.
- PCRE-based approaches are memory heavy, most operating through one of
two approaches: split the string into an array of chunks and then process
the chunks; or use `preg_replace_callback()` and pass around sub-
allocations of the input string.
- PCRE functions give no method for zero-allocation operation. This makes
PCRE-based approaches vulnerable to denial-of-service attacks and
accidents because it must match and extract a full group to operate on it.
(For example, HTML numeric character references may contain an arbitrary
number of leading zeros and can choke `preg_match()`, doubling memory use
of more just to make the match.
Finally, the plethora of approaches in Core can be dizzying when trying to
understand behaviors and bugs. Without a well-defined specification of the
behaviors, things only //seem// to work because the code does not explain
the bounds of operation.
=== Why a single fallback in pure PHP?
While a pure user-space PHP implementation of UTF-8 handling will never be
able to approach the performance and efficiency of underlying native
libraries, it can be reasonably fast and //fast enough// for reasonable
use. Reportedly, around 0.5% of WordPress installations lack the
`mbstring` extension. For those sites which lack the extension, WordPress
can make a tradeoff between performance and harmony of behaviors.
When functions try a chain of fallback options (e.g. `mbstring`, `iconv`,
`PCRE` with Unicode support, `PCRE` with byte patterns hard-coded, ASCII)
then it breaks apart WordPress’ reliability and makes for very difficult
debugging and resolution.
A pure PHP fallback gives WordPress the ability to identify and fix bugs
in its implementation and the implementation details are visible for all
to inspect. There’s a much higher barrier to try and diagnose why the
functions return the wrong result when one needs to scan PHP’s source
code, `iconv`’s source code, one of several PCRE libraries’ source code,
and more.
----
While the up-front effort is high, the HTML API has demonstrated how
valuable it can be to have a reliable API in WordPress for interoperating
with various web standards. It’s time to modernize WordPress to support
UTF-8 universally and remove the existing complexity of ad-hoc handling
and runtime dependencies.
== Related tickets
- #38044: Deprecate `seems_utf8()` and add `wp_is_valid_utf8()`.
- #63837: Overhaul `wp_check_invalid_utf8()` to remove runtime
dependencies.
- #29717: Optimize and fix `wp_check_invalid_utf8()`.
- #43224: Remove `$pcre_utf8` logic from `wp_check_invalid_utf8()`.
- #55603: Address deprecation of `utf8_decode()` and `utf8_encode()`,
discussion of requiring `mbstring`.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/63863>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list