[wp-trac] [WordPress Trac] #38044: Make seems_utf8() RFC 3629 compliant.

Mon Jul 28 18:17:39 UTC 2025

#38044: Make seems_utf8() RFC 3629 compliant.
--------------------------+-----------------------------
 Reporter:  gitlost       |       Owner:  (none)
     Type:  defect (bug)  |      Status:  new
 Priority:  normal        |   Milestone:  Future Release
Component:  Formatting    |     Version:  1.2.1
 Severity:  normal        |  Resolution:
 Keywords:  has-patch     |     Focuses:
--------------------------+-----------------------------

Comment (by dmsnell):

 While it would be great to get some historic insight, here are some notes
 on my best guess about this function. Maybe it was pulled in from another
 library which formed with the same misunderstanding about UTF-8, because
 UTF-8 //never// supports five or six byte sequences.

 [https://www.rfc-editor.org/rfc/rfc2279 RFC2279] was a //draft// proposal
 stating the following:

 > In UTF-8, characters are encoded using sequences of 1 to 6 octets.

 What was adopted, however, is [https://www.rfc-editor.org/rfc/rfc3629
 RFC3629], which defines UTF-8 in this different way:

 > In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
 accessible range) are encoded using sequences of 1 to 4 octets.

 Further, there is a security note in that RFC

 > Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646
 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF,
 yielding sequences of up to 6 bytes.  There is therefore a risk of buffer
 overflow if the range of character numbers is not explicitly limited to
 U+10FFFF or if buffer sizing doesn't take into account the possibility of
 5- and 6-byte sequences.

 Because of the confusion of what was //proposed// vs what was //adopted//,
 and of allocating space in a decode buffer, I think someone might have
 been wanting to be extra careful about inbound UTF-8 streams with more
 than four bytes.

 However, any byte stream with five or six byte sequences is definitively
 //not// UTF-8 and treating them as such is actually a bigger issue than
 any buffer allocation would be, particularly since in a function like this
 we’re not decoding the bytes or converting them to integers.

 ----

 Were this function //faster// by indicating something like “the bytes
 provided resemble a UTF-8 sequence but may be invalid” then it could be
 understandable to retain the dangerous behavior, but I do propose that
 this is an historic mistake that would only benefit Core to remediate.

 For a simple update it would be as simple of a change as removing the `$n
 = 5` case.

 Granted, removing the five-byte case is not enough to make this function
 validate UTF-8. The function is not //in any way// sufficiently designed
 to validate UTF-8 data, only to indicate if it roughly resembles UTF-8.

 This is why I propose also that we deprecate the function entirely due to
 the nuance required to understand it, and the relative lack of value it
 provides, and the high risk of using it. We can point developers to clear
 and well-specified conforming functions like `wp_is_valid_utf8()` which
 will work regardless of the runtime system, its extensions, and its build
 options.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/38044#comment:6>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform