[wp-trac] [WordPress Trac] #63837: Update wp_check_invalid_utf8()

WordPress Trac noreply at wordpress.org
Fri Aug 22 22:40:39 UTC 2025


#63837: Update wp_check_invalid_utf8()
--------------------------------------+---------------------
 Reporter:  dmsnell                   |       Owner:  (none)
     Type:  enhancement               |      Status:  new
 Priority:  normal                    |   Milestone:  6.9
Component:  Formatting                |     Version:  trunk
 Severity:  normal                    |  Resolution:
 Keywords:  has-patch has-unit-tests  |     Focuses:
--------------------------------------+---------------------

Comment (by dmsnell):

 It’s rare for me to go on a hike and not come back with interesting ideas
 about how to proceed on these long problems. To that end, I am sorry to
 report that I made this issue much more expansive, bringing back ideas
 from [https://github.com/WordPress/wordpress-develop/pull/6883 WordPress
 /wordpress-develop#6883]. I probably need to create a new Ticket for this,
 but all of the work revolves around one remaining issue that was bothering
 me about the proposed PR: //can we avoid maintaining two or more UTF-8
 decoders?//

 I would like to split off the work I have done to stage a broader update
 for 6.9.0.

 ----

 @siliconforks the code in the proposed PR and all updated versions of it
 account for that change. thanks again!

 ----

 > The strip behavior is very interesting and the proposed change in
 behavior is important.

 I’m not sure on this @jonsurrell but torn myself. It’s not a proposed
 change in behavior, just a bug-fix for a defect that’s been in the code
 for [https://github.com/WordPress/WordPress-
 develop/commit/ec804d2905a4707fd3920b111e43591a94c9cde4 16 years]. I did
 look for tickets reporting this but couldn’t find any, and I suspect that
 the bug has been noticed, but hard to report because it’s not obvious
 where it appears. The failing behavior is to return an empty string, and
 this is the same behavior when a string is deemed “unsafe” by other means.

 What this means is that this specific bug is masked by other intended
 rejections where the function is called. Fixing this bug might actually
 solve a number of unreported issues. The function docs clearly indicate
 that invalid bytes are to be “stripped” but doesn’t indicate how.
 Regardless, there’s a clear statement that //only// the invalid bytes are
 to be stripped, not the entire string.

 Given that we should avoid //removing// invalid bytes, I propose
 substituting.

 > That's likely what was intended in the original implementation, but is
 certainly not the documented behavior of this function

 There’s also `//TRANSLIT` but these all seem a bit blurry to understand
 from the PHP perspective. That is, WordPress doesn’t have control over
 them. We can guess that it was supposed to be `//IGNORE`, but I read the
 documented behavior to be exactly that. “Strip” implies removing
 characters, which is what `//IGNORE` does.

 That being said, I feel like “strip” is broad enough to incorporate the
 substitution. It’s not a stretch to me to say that putting in the � in
 place of the invalid bytes is tantamount to stripping them out. Something
 is still there, but they aren’t.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63837#comment:8>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list