[wp-trac] [WordPress Trac] #38044: Make seems_utf8() RFC 3629 compliant.

Tue Oct 21 04:10:29 UTC 2025

#38044: Make seems_utf8() RFC 3629 compliant.
--------------------------+----------------------
 Reporter:  gitlost       |       Owner:  dmsnell
     Type:  defect (bug)  |      Status:  closed
 Priority:  normal        |   Milestone:  6.9
Component:  Formatting    |     Version:  1.2.1
 Severity:  normal        |  Resolution:  fixed
 Keywords:  has-patch     |     Focuses:
--------------------------+----------------------
Changes (by dmsnell):

 * status:  reopened => closed
 * resolution:   => fixed
 * milestone:  Future Release => 6.9

Comment:

 Re-closing this for now. @SergeyBiryukov I want to remain open about the
 tests that I removed, so I would still love to hear if we have compelling
 reasons to keep them, though I don’t think they are actually carrying
 their weight as they were written.

 > they run the function on several examples of both UTF-8 and non-UTF-8
 strings

 I could imagine rewriting the tests to align them more with what I believe
 the purpose of the function is, which would involve using
 `mb_convert_encoding()` to create strings in various non-UTF-8 compatible
 encodings and then check if they `seems_utf8()`.

 > and assert that the expected result is correct

 But again, I know we’re on shaky ground because we don’t actually know
 what the function should do and thus “correct” seems a bit nebulous or
 subjective.

  - Does the function return `true` for valid UTF-8 strings? Yes.
  - Does the function return `false` for invalid UTF-8 strings? Not
 particularly. It returns `false` for a subset of invalid UTF-8 strings but
 returns `true` for a few categories of invalid strings.
  - Does the function return `false` for non-UTF-8 strings? Mostly,
 excluding some categories.

 After all my playing with it, I think the function was intended to be
 something like `string_parses_as_a_utf8_bitstream()` except it’s defective
 by design. We could fix it and thus create “correct” for the tests, but
 that //breaks// behavior for existing use-cases.

 Maybe I’m overthinking it, but I really don’t like the ambiguity of this
 function of our inability to know what it’s supposed to do. In practice
 it’s used to answer an invalid question: “Is this string //already// UTF-8
 or should I //convert it to UTF-8// instead?" Even in the circumstances
 where we could say that question is valid, if it were, then it would imply
 that other code in Core would collapse and corrupt output.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/38044#comment:35>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform