[wp-trac] [WordPress Trac] #38044: Make seems_utf8() RFC 3629 compliant.
WordPress Trac
noreply at wordpress.org
Tue Oct 21 04:10:29 UTC 2025
#38044: Make seems_utf8() RFC 3629 compliant.
--------------------------+----------------------
Reporter: gitlost | Owner: dmsnell
Type: defect (bug) | Status: closed
Priority: normal | Milestone: 6.9
Component: Formatting | Version: 1.2.1
Severity: normal | Resolution: fixed
Keywords: has-patch | Focuses:
--------------------------+----------------------
Changes (by dmsnell):
* status: reopened => closed
* resolution: => fixed
* milestone: Future Release => 6.9
Comment:
Re-closing this for now. @SergeyBiryukov I want to remain open about the
tests that I removed, so I would still love to hear if we have compelling
reasons to keep them, though I don’t think they are actually carrying
their weight as they were written.
> they run the function on several examples of both UTF-8 and non-UTF-8
strings
I could imagine rewriting the tests to align them more with what I believe
the purpose of the function is, which would involve using
`mb_convert_encoding()` to create strings in various non-UTF-8 compatible
encodings and then check if they `seems_utf8()`.
> and assert that the expected result is correct
But again, I know we’re on shaky ground because we don’t actually know
what the function should do and thus “correct” seems a bit nebulous or
subjective.
- Does the function return `true` for valid UTF-8 strings? Yes.
- Does the function return `false` for invalid UTF-8 strings? Not
particularly. It returns `false` for a subset of invalid UTF-8 strings but
returns `true` for a few categories of invalid strings.
- Does the function return `false` for non-UTF-8 strings? Mostly,
excluding some categories.
After all my playing with it, I think the function was intended to be
something like `string_parses_as_a_utf8_bitstream()` except it’s defective
by design. We could fix it and thus create “correct” for the tests, but
that //breaks// behavior for existing use-cases.
Maybe I’m overthinking it, but I really don’t like the ambiguity of this
function of our inability to know what it’s supposed to do. In practice
it’s used to answer an invalid question: “Is this string //already// UTF-8
or should I //convert it to UTF-8// instead?" Even in the circumstances
where we could say that question is valid, if it were, then it would imply
that other code in Core would collapse and corrupt output.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/38044#comment:35>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list