[wp-trac] [WordPress Trac] #63863: Standardize UTF-8 handling and fallbacks in 6.9

WordPress Trac noreply at wordpress.org
Sat Aug 23 00:07:03 UTC 2025


#63863: Standardize UTF-8 handling and fallbacks in 6.9
-------------------------+--------------------
 Reporter:  dmsnell      |      Owner:  (none)
     Type:  enhancement  |     Status:  new
 Priority:  normal       |  Milestone:  6.9
Component:  Formatting   |    Version:  trunk
 Severity:  normal       |   Keywords:
  Focuses:               |
-------------------------+--------------------
 Core uses a wide variety of methods for working with UTF-8, especially so
 when the running server and process doesn’t have the `mbstring` extension
 loaded. Much of the diversity in implementation and specification in this
 code is the legacy of the time and context in which it was added. Today,
 it’s viable to unify and standardize code handling UTF-8, which is what
 this ticket proposes.

 **WordPress should unify handling of UTF-8 data** and depend foremost on
 the `mbstring`-provided functions which have received significant updates
 during the 8.x releases of PHP. When unavailable, WordPress should defer
 to a single and reasonable implementation of the functionality in pure
 PHP.

 === Why `mbstring`?

 The `mbstring` PHP extension has been
 [https://make.wordpress.org/hosting/handbook/server-environment/#php-
 extensions highly recommended] for a long time now and it provides high-
 performance and spec-compliant tools for working with UTF-8 and other text
 encodings. The extension has seen active development and integration with
 PHP throughout the PHP 8.x cycle, including improved Unicode support and a
 couple of notable enhancements for this work:

  - Since PHP 8.1.6 the `mb_scrub()` function follows the Unicode
 [https://www.unicode.org/versions/Unicode16.0.0/core-
 spec/chapter-3/#G66453 substitution of maximal subparts]. This is an
 important aspect for UTF-8 decoders which ensures safe decoding of
 untrusted inputs //and// standardized interoperability with UTF-8 handling
 in other systems.
  - Since PHP 8.3.0 the UTF-8 validity of a string [https://github.com/php
 /php-src/commit/d0d834429f55053e827d9c1667d11efd33924cac is cached on the
 ZSTR value itself] and repeated checks are property-accesses for strings
 validated as UTF-8.
  - These functions use high-performance implementations in C including the
 use of SIMD and vectorized algorithms whose speed cannot be matched in PHP
 and which aren’t deployed in every other potential method for working with
 UTF-8 (for example, in the `iconv` and `PCRE` libraries).

 The case for `mbstring` is strong. On the contrary side I believe that
 there are reasons to //avoid// other methods:
  - `iconv()`’s behavior depends on the version of `iconv()` compiled in to
 the running PHP version and it’s hard to establish uniform behavior in the
 PHP code.
  - `iconv()` cannot “scrub” invalid byte sequences from UTF-8 texts. It
 only supports an `//IGNORE` mode which removes byte sequences and this is
 not a sound security practice. It’s far more dangerous than substituting
 the sequences as is done with `mb_scrub()`.
  - PCRE-based approaches are also highly system-dependent.
  - Older versions of the PCRE libraries accepted certain invalid UTF-8
 sequences as valid, e.g. with five-byte sequences.
  - PCRE-based approaches are memory heavy, most operating through one of
 two approaches: split the string into an array of chunks and then process
 the chunks; or use `preg_replace_callback()` and pass around sub-
 allocations of the input string.
  - PCRE functions give no method for zero-allocation operation. This makes
 PCRE-based approaches vulnerable to denial-of-service attacks and
 accidents because it must match and extract a full group to operate on it.
 (For example, HTML numeric character references may contain an arbitrary
 number of leading zeros and can choke `preg_match()`, doubling memory use
 of more just to make the match.

 Finally, the plethora of approaches in Core can be dizzying when trying to
 understand behaviors and bugs. Without a well-defined specification of the
 behaviors, things only //seem// to work because the code does not explain
 the bounds of operation.

 === Why a single fallback in pure PHP?

 While a pure user-space PHP implementation of UTF-8 handling will never be
 able to approach the performance and efficiency of underlying native
 libraries, it can be reasonably fast and //fast enough// for reasonable
 use. Reportedly, around 0.5% of WordPress installations lack the
 `mbstring` extension. For those sites which lack the extension, WordPress
 can make a tradeoff between performance and harmony of behaviors.

 When functions try a chain of fallback options (e.g. `mbstring`, `iconv`,
 `PCRE` with Unicode support, `PCRE` with byte patterns hard-coded, ASCII)
 then it breaks apart WordPress’ reliability and makes for very difficult
 debugging and resolution.

 A pure PHP fallback gives WordPress the ability to identify and fix bugs
 in its implementation and the implementation details are visible for all
 to inspect. There’s a much higher barrier to try and diagnose why the
 functions return the wrong result when one needs to scan PHP’s source
 code, `iconv`’s source code, one of several PCRE libraries’ source code,
 and more.

 ----

 While the up-front effort is high, the HTML API has demonstrated how
 valuable it can be to have a reliable API in WordPress for interoperating
 with various web standards. It’s time to modernize WordPress to support
 UTF-8 universally and remove the existing complexity of ad-hoc handling
 and runtime dependencies.

 == Related tickets

  - #38044: Deprecate `seems_utf8()` and add `wp_is_valid_utf8()`.
  - #63837: Overhaul `wp_check_invalid_utf8()` to remove runtime
 dependencies.
    - #29717: Optimize and fix `wp_check_invalid_utf8()`.
    - #43224: Remove `$pcre_utf8` logic from `wp_check_invalid_utf8()`.
  - #55603: Address deprecation of `utf8_decode()` and `utf8_encode()`,
 discussion of requiring `mbstring`.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63863>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list