[wp-trac] [WordPress Trac] #63837: Update wp_check_invalid_utf8()
WordPress Trac
noreply at wordpress.org
Fri Aug 22 22:40:39 UTC 2025
#63837: Update wp_check_invalid_utf8()
--------------------------------------+---------------------
Reporter: dmsnell | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: 6.9
Component: Formatting | Version: trunk
Severity: normal | Resolution:
Keywords: has-patch has-unit-tests | Focuses:
--------------------------------------+---------------------
Comment (by dmsnell):
It’s rare for me to go on a hike and not come back with interesting ideas
about how to proceed on these long problems. To that end, I am sorry to
report that I made this issue much more expansive, bringing back ideas
from [https://github.com/WordPress/wordpress-develop/pull/6883 WordPress
/wordpress-develop#6883]. I probably need to create a new Ticket for this,
but all of the work revolves around one remaining issue that was bothering
me about the proposed PR: //can we avoid maintaining two or more UTF-8
decoders?//
I would like to split off the work I have done to stage a broader update
for 6.9.0.
----
@siliconforks the code in the proposed PR and all updated versions of it
account for that change. thanks again!
----
> The strip behavior is very interesting and the proposed change in
behavior is important.
I’m not sure on this @jonsurrell but torn myself. It’s not a proposed
change in behavior, just a bug-fix for a defect that’s been in the code
for [https://github.com/WordPress/WordPress-
develop/commit/ec804d2905a4707fd3920b111e43591a94c9cde4 16 years]. I did
look for tickets reporting this but couldn’t find any, and I suspect that
the bug has been noticed, but hard to report because it’s not obvious
where it appears. The failing behavior is to return an empty string, and
this is the same behavior when a string is deemed “unsafe” by other means.
What this means is that this specific bug is masked by other intended
rejections where the function is called. Fixing this bug might actually
solve a number of unreported issues. The function docs clearly indicate
that invalid bytes are to be “stripped” but doesn’t indicate how.
Regardless, there’s a clear statement that //only// the invalid bytes are
to be stripped, not the entire string.
Given that we should avoid //removing// invalid bytes, I propose
substituting.
> That's likely what was intended in the original implementation, but is
certainly not the documented behavior of this function
There’s also `//TRANSLIT` but these all seem a bit blurry to understand
from the PHP perspective. That is, WordPress doesn’t have control over
them. We can guess that it was supposed to be `//IGNORE`, but I read the
documented behavior to be exactly that. “Strip” implies removing
characters, which is what `//IGNORE` does.
That being said, I feel like “strip” is broad enough to incorporate the
substitution. It’s not a stretch to me to say that putting in the � in
place of the invalid bytes is tantamount to stripping them out. Something
is still there, but they aren’t.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/63837#comment:8>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list