[wp-trac] [WordPress Trac] #63837: Update wp_check_invalid_utf8()
WordPress Trac
noreply at wordpress.org
Wed Aug 20 09:01:21 UTC 2025
#63837: Update wp_check_invalid_utf8()
--------------------------------------+---------------------
Reporter: dmsnell | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: 6.9
Component: Formatting | Version: trunk
Severity: normal | Resolution:
Keywords: has-patch has-unit-tests | Focuses:
--------------------------------------+---------------------
Comment (by jonsurrell):
> It returns false if $strip = true is requested.
**The strip behavior is very interesting and the proposed change in
behavior is important.** It's tempting to rename the parameter to
something like `$replace` or `$substitute`, but that's a potential
breaking change with PHP8 named parameters.
-----
[https://www.unicode.org/versions/Unicode16.0.0/core-
spec/chapter-5/#G40630 The unicode standard is clear on the practice of
U+FFFD substitution] and [https://www.unicode.org/versions/Unicode16.0.0
/core-spec/chapter-23/#G19653 notes that ignoring "bad" bytes represents a
security risk]:
> If a noncharacter is received in open interchange, an application is not
required to interpret it in any way. It is good practice, however, to
recognize it as a noncharacter and to take appropriate action, such as
replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in
the text. It is not recommended to simply delete noncharacter code points
from such text, because of the potential security issues caused by
deleting uninterpreted characters.
[https://www.unicode.org/reports/tr36/tr36-15.html#Substituting_for_Ill_Formed_Subsequences
The mentioned technical report] gives some examples, but also clearly
states:
> If characters are to be substituted for ill-formed subsequences, it is
important that those characters be relatively safe.
>
> - Deletion (substituting the empty string) can be quite nasty, because
it joins characters that would have been separate…
> - Substituting characters that are valid syntax for constructs such as
file names has similar problems. For example, the '.' can be very
problematic.
> - U+FFFD is usually unproblematic, because it is designed expressly
for this kind of purpose. That is, because it does not have syntactic
meaning in programming languages or structured data, it will typically
just cause a failure in parsing.
> - Where U+FFFD is not available, a common alternative is "?". While
this character may occur syntactically, it appears to be less subject to
attack than most others.
-----
I was reviewing the proposed changes to understand the difference and
discovered this myself. The existing strip implementation relies on
`iconv` in this way:
{{{#!php
<?php
var_dump(
iconv( 'utf-8', 'utf-8', ".\xC0." ), // C0 is never
valid.
iconv( 'utf-8', 'utf-8', ".\xE2\x8C." ), // Missing A3 at
end.
iconv( 'utf-8', 'utf-8', ".\xE2\x8C\xE2\x8C." ), // Maximal subparts
replaced separately.
iconv( 'utf-8', 'utf-8', ".\xC1\xBF." ), // Overlong sequence.
iconv( 'utf-8', 'utf-8', ".\xED\xA0\x80." ) // Surrogate half.
);
}}}
Each of those generates a notice and returns `false` because the
conversion fails. [https://www.php.net/manual/en/function.iconv.php I
reviewed iconv documentation] to confirm this behavior and noticed the
following:
> If the string //IGNORE is appended, characters that cannot be
represented in the target charset are silently discarded.
That's likely what was intended in the original implementation, but is
certainly not the documented behavior of this function. Each of these
returns the string `".."` with no warning:
{{{#!php
<?php
var_dump(
iconv( 'utf-8', 'utf-8//IGNORE', ".\xC0." ), // C0 is
never valid.
iconv( 'utf-8', 'utf-8//IGNORE', ".\xE2\x8C." ), // Missing
A3 at end.
iconv( 'utf-8', 'utf-8//IGNORE', ".\xE2\x8C\xE2\x8C." ), // Maximal
subparts replaced separately.
iconv( 'utf-8', 'utf-8//IGNORE', ".\xC1\xBF." ), // Overlong
sequence.
iconv( 'utf-8', 'utf-8//IGNORE', ".\xED\xA0\x80." ) // Surrogate
half.
);
}}}
--
Ticket URL: <https://core.trac.wordpress.org/ticket/63837#comment:6>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list