[wp-trac] [WordPress Trac] #38044: Make seems_utf8() RFC 3629 compliant.
WordPress Trac
noreply at wordpress.org
Mon Jul 28 18:17:39 UTC 2025
#38044: Make seems_utf8() RFC 3629 compliant.
--------------------------+-----------------------------
Reporter: gitlost | Owner: (none)
Type: defect (bug) | Status: new
Priority: normal | Milestone: Future Release
Component: Formatting | Version: 1.2.1
Severity: normal | Resolution:
Keywords: has-patch | Focuses:
--------------------------+-----------------------------
Comment (by dmsnell):
While it would be great to get some historic insight, here are some notes
on my best guess about this function. Maybe it was pulled in from another
library which formed with the same misunderstanding about UTF-8, because
UTF-8 //never// supports five or six byte sequences.
[https://www.rfc-editor.org/rfc/rfc2279 RFC2279] was a //draft// proposal
stating the following:
> In UTF-8, characters are encoded using sequences of 1 to 6 octets.
What was adopted, however, is [https://www.rfc-editor.org/rfc/rfc3629
RFC3629], which defines UTF-8 in this different way:
> In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16
accessible range) are encoded using sequences of 1 to 4 octets.
Further, there is a security note in that RFC
> Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646
description of UTF-8 allows encoding character numbers up to U+7FFFFFFF,
yielding sequences of up to 6 bytes. There is therefore a risk of buffer
overflow if the range of character numbers is not explicitly limited to
U+10FFFF or if buffer sizing doesn't take into account the possibility of
5- and 6-byte sequences.
Because of the confusion of what was //proposed// vs what was //adopted//,
and of allocating space in a decode buffer, I think someone might have
been wanting to be extra careful about inbound UTF-8 streams with more
than four bytes.
However, any byte stream with five or six byte sequences is definitively
//not// UTF-8 and treating them as such is actually a bigger issue than
any buffer allocation would be, particularly since in a function like this
we’re not decoding the bytes or converting them to integers.
----
Were this function //faster// by indicating something like “the bytes
provided resemble a UTF-8 sequence but may be invalid” then it could be
understandable to retain the dangerous behavior, but I do propose that
this is an historic mistake that would only benefit Core to remediate.
For a simple update it would be as simple of a change as removing the `$n
= 5` case.
Granted, removing the five-byte case is not enough to make this function
validate UTF-8. The function is not //in any way// sufficiently designed
to validate UTF-8 data, only to indicate if it roughly resembles UTF-8.
This is why I propose also that we deprecate the function entirely due to
the nuance required to understand it, and the relative lack of value it
provides, and the high risk of using it. We can point developers to clear
and well-specified conforming functions like `wp_is_valid_utf8()` which
will work regardless of the runtime system, its extensions, and its build
options.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/38044#comment:6>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list