[wp-trac] [WordPress Trac] #63837: Update wp_check_invalid_utf8()
WordPress Trac
noreply at wordpress.org
Tue Aug 19 02:23:11 UTC 2025
#63837: Update wp_check_invalid_utf8()
--------------------------------------+---------------------
Reporter: dmsnell | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: 6.9
Component: Formatting | Version: trunk
Severity: normal | Resolution:
Keywords: has-patch has-unit-tests | Focuses:
--------------------------------------+---------------------
Description changed by dmsnell:
Old description:
> There are a few challenges with `wp_check_invalid_utf8()`
>
> - Its behavior is dependent on Unicode support in the PCRE functions.
> - PCRE Unicode support has changed across versions, with older versions
> allowing invalid UTF-8.
> - It returns `false` if `$strip = true` is requested.
> - When a system lacks support there’s zero fallback.
> - It assumes that input strings are encoded with `blog_charset`.
>
> The last point is inherent to how the function works, but the other
> points can be updated by relying on the newer `wp_is_valid_utf8()` and by
> providing a custom fallback method to strip out invalid byte sequences.
New description:
There are a few challenges with `wp_check_invalid_utf8()`
- Its behavior is dependent on Unicode support in the PCRE functions.
- PCRE Unicode support has changed across versions, with older versions
allowing invalid UTF-8.
- It returns `false` if `$strip = true` is requested.
- When a system lacks support there’s zero fallback.
- It assumes that input strings are encoded with `blog_charset`.
The last point is inherent to how the function works, but the other points
can be updated by relying on the newer `wp_is_valid_utf8()` and by
providing a custom fallback method to strip out invalid byte sequences.
Improving clarity around this function involves removing the blurred line
between determining if content is allegedly UTF-8 and handling it as if it
is. A new function, `wp_scrub_utf8()`, can be used to produce a valid
UTF-8 string formed by replacing invalid sequences of bytes with the
Unicode replacement character U+FFFD (`�`). This behaves as
`wp_check_invalid_bytes( $s, true )` //should// work if the `blog_charset`
is `UTF-8`.
Ideally, data should be transformed //from// some encoding //into//
`UTF-8` as soon as it enters WordPress and then code which previously
called `wp_check_invalid_utf8()` can now call `wp_scrub_utf8()` //if// it
truly needs to remove invalid bytes.
--
--
Ticket URL: <https://core.trac.wordpress.org/ticket/63837#comment:2>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list