[wp-trac] [WordPress Trac] #63837: Update wp_check_invalid_utf8()

WordPress Trac noreply at wordpress.org
Fri Aug 22 22:53:48 UTC 2025


#63837: Update wp_check_invalid_utf8()
--------------------------------------+---------------------
 Reporter:  dmsnell                   |       Owner:  (none)
     Type:  enhancement               |      Status:  new
 Priority:  normal                    |   Milestone:  6.9
Component:  Formatting                |     Version:  trunk
 Severity:  normal                    |  Resolution:
 Keywords:  has-patch has-unit-tests  |     Focuses:
--------------------------------------+---------------------
Description changed by dmsnell:

Old description:

> There are a few challenges with `wp_check_invalid_utf8()`
>
>  - Its behavior is dependent on Unicode support in the PCRE functions.
>  - PCRE Unicode support has changed across versions, with older versions
> allowing invalid UTF-8.
>  - It returns `false` if `$strip = true` is requested.
>  - When a system lacks support there’s zero fallback.
>  - It assumes that input strings are encoded with `blog_charset`.
>
> The last point is inherent to how the function works, but the other
> points can be updated by relying on the newer `wp_is_valid_utf8()` and by
> providing a custom fallback method to strip out invalid byte sequences.
>
> Improving clarity around this function involves removing the blurred line
> between determining if content is allegedly UTF-8 and handling it as if
> it is. A new function, `wp_scrub_utf8()`, can be used to produce a valid
> UTF-8 string formed by replacing invalid sequences of bytes with the
> Unicode replacement character U+FFFD (`�`). This behaves as
> `wp_check_invalid_bytes( $s, true )` //should// work if the
> `blog_charset` is `UTF-8`.
>
> Ideally, data should be transformed //from// some encoding //into//
> `UTF-8` as soon as it enters WordPress and then code which previously
> called `wp_check_invalid_utf8()` can now call `wp_scrub_utf8()` //if// it
> truly needs to remove invalid bytes.

New description:

 There are a few challenges with `wp_check_invalid_utf8()`

  - Its behavior is dependent on Unicode support in the PCRE functions.
  - PCRE Unicode support has changed across versions, with older versions
 allowing invalid UTF-8.
  - It returns `false` if `$strip = true` is requested.
  - When a system lacks support there’s zero fallback.
  - It assumes that input strings are encoded with `blog_charset`.

 The last point is inherent to how the function works, but the other points
 can be updated by relying on the newer `wp_is_valid_utf8()` and by
 providing a custom fallback method to strip out invalid byte sequences.

 Improving clarity around this function involves removing the blurred line
 between determining if content is allegedly UTF-8 and handling it as if it
 is. A new function, `wp_scrub_utf8()`, can be used to produce a valid
 UTF-8 string formed by replacing invalid sequences of bytes with the
 Unicode replacement character U+FFFD (`�`). This behaves as
 `wp_check_invalid_bytes( $s, true )` //should// work if the `blog_charset`
 is `UTF-8`.

 Ideally, data should be transformed //from// some encoding //into//
 `UTF-8` as soon as it enters WordPress and then code which previously
 called `wp_check_invalid_utf8()` can now call `wp_scrub_utf8()` //if// it
 truly needs to remove invalid bytes.

 == Related

  - Resolves  #43224

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63837#comment:9>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list