[wp-trac] [WordPress Trac] #63837: Update wp_check_invalid_utf8()

Wed Aug 20 09:01:21 UTC 2025

#63837: Update wp_check_invalid_utf8()
--------------------------------------+---------------------
 Reporter:  dmsnell                   |       Owner:  (none)
     Type:  enhancement               |      Status:  new
 Priority:  normal                    |   Milestone:  6.9
Component:  Formatting                |     Version:  trunk
 Severity:  normal                    |  Resolution:
 Keywords:  has-patch has-unit-tests  |     Focuses:
--------------------------------------+---------------------

Comment (by jonsurrell):

 > It returns false if $strip = true is requested.

 **The strip behavior is very interesting and the proposed change in
 behavior is important.** It's tempting to rename the parameter to
 something like `$replace` or `$substitute`, but that's a potential
 breaking change with PHP8 named parameters.

 -----

 [https://www.unicode.org/versions/Unicode16.0.0/core-
 spec/chapter-5/#G40630 The unicode standard is clear on the practice of
 U+FFFD substitution] and [https://www.unicode.org/versions/Unicode16.0.0
 /core-spec/chapter-23/#G19653 notes that ignoring "bad" bytes represents a
 security risk]:

 > If a noncharacter is received in open interchange, an application is not
 required to interpret it in any way. It is good practice, however, to
 recognize it as a noncharacter and to take appropriate action, such as
 replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in
 the text. It is not recommended to simply delete noncharacter code points
 from such text, because of the potential security issues caused by
 deleting uninterpreted characters.

 [https://www.unicode.org/reports/tr36/tr36-15.html#Substituting_for_Ill_Formed_Subsequences
 The mentioned technical report] gives some examples, but also clearly
 states:

 > If characters are to be substituted for ill-formed subsequences, it is
 important that those characters be relatively safe.
 >
 > - Deletion (substituting the empty string) can be quite nasty, because
 it joins characters that would have been separate…
 > - Substituting characters that are valid syntax for constructs such as
 file names has similar problems. For example, the '.' can be very
 problematic.
 >   - U+FFFD is usually unproblematic, because it is designed expressly
 for this kind of purpose. That is, because it does not have syntactic
 meaning in programming languages or structured data, it will typically
 just cause a failure in parsing.
 >   - Where U+FFFD is not available, a common alternative is "?". While
 this character may occur syntactically, it appears to be less subject to
 attack than most others.

 -----

 I was reviewing the proposed changes to understand the difference and
 discovered this myself. The existing strip implementation relies on
 `iconv` in this way:

 {{{#!php
 <?php
 var_dump(
     iconv( 'utf-8', 'utf-8', ".\xC0." ),              // C0 is never
 valid.
     iconv( 'utf-8', 'utf-8', ".\xE2\x8C." ),          // Missing A3 at
 end.
     iconv( 'utf-8', 'utf-8', ".\xE2\x8C\xE2\x8C." ), // Maximal subparts
 replaced separately.
     iconv( 'utf-8', 'utf-8', ".\xC1\xBF." ),         // Overlong sequence.
     iconv( 'utf-8', 'utf-8', ".\xED\xA0\x80." )    // Surrogate half.
 );
 }}}

 Each of those generates a notice and returns `false` because the
 conversion fails. [https://www.php.net/manual/en/function.iconv.php I
 reviewed iconv documentation] to confirm this behavior and noticed the
 following:

 > If the string //IGNORE is appended, characters that cannot be
 represented in the target charset are silently discarded.

 That's likely what was intended in the original implementation, but is
 certainly not the documented behavior of this function. Each of these
 returns the string `".."` with no warning:

 {{{#!php
 <?php
 var_dump(
     iconv( 'utf-8', 'utf-8//IGNORE', ".\xC0." ),              // C0 is
 never valid.
     iconv( 'utf-8', 'utf-8//IGNORE', ".\xE2\x8C." ),          // Missing
 A3 at end.
     iconv( 'utf-8', 'utf-8//IGNORE', ".\xE2\x8C\xE2\x8C." ), // Maximal
 subparts replaced separately.
     iconv( 'utf-8', 'utf-8//IGNORE', ".\xC1\xBF." ),         // Overlong
 sequence.
     iconv( 'utf-8', 'utf-8//IGNORE', ".\xED\xA0\x80." )    // Surrogate
 half.
 );
 }}}

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63837#comment:6>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform