[wp-trac] [WordPress Trac] #38044: Make seems_utf8() RFC 3629 compliant.
WordPress Trac
noreply at wordpress.org
Tue Jul 29 02:44:30 UTC 2025
#38044: Make seems_utf8() RFC 3629 compliant.
--------------------------+-----------------------------
Reporter: gitlost | Owner: (none)
Type: defect (bug) | Status: new
Priority: normal | Milestone: Future Release
Component: Formatting | Version: 1.2.1
Severity: normal | Resolution:
Keywords: has-patch | Focuses:
--------------------------+-----------------------------
Comment (by dmsnell):
After much searching I have found the source of `seems_utf8()`.
[https://stackoverflow.com/users/759866/benmorel Ben Morel]
[https://stackoverflow.com/questions/7869412/how-to-validate-a-utf-
sequence-in-php#comment9601053_7869501 noted on stackoverflow] in 2011
that he had previously written a user-land PHP implementation of
`mb_check_encoding($string, 'UTF-8')`.
{{{
> How to validate a utf sequence in PHP?
>
> …validating all incoming utf data, to ensure its valid and coherent…ones
> Ive seen seem incomplete…allow invalid 3rd bytes etc…I'm…concerned
> about detecting…overlong encoding
If needed, I wrote a while ago a pure PHP version, that you can find
[here] (there's room for improvement, but it works.)
}}}
He had left this as a comment on the PHP docs page for
[https://www.php.net/manual/en/function.utf8-encode.php utf8_encode()],
but the comment was removed at some point after Nov. 30, 2019 — it was
effectively removed from the internet.
Thankfully, the Internet Archive has
[https://web.archive.org/web/20191130212701/https://www.php.net/manual/en/function.utf8-encode.php
copies of the old docs page] where the comment is present.
{{{
[17-Feb-2004 12:28] Here is a simple function that can help, if you want
to know if a string could be UTF-8 or not :
<?php
function seems_utf8($Str) {
for ($i=0; $i<strlen($Str); $i++) {
if (ord($Str[$i]) < 0x80) $n=0; # 0bbbbbbb
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif ((ord($Str[$i]) & 0xF0) == 0xF0) $n=3; # 1111bbbb
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n octets that match 10bbbbbb follow ?
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80)) return
false;
}
}
return true;
}
?>
}}}
Less than an hour later, Ben posted an update.
{{{
[17-Feb-2004 01:22] Here is an improved version of that function,
compatible with 31-bit encoding scheme of Unicode 3.x :
<?php
function seems_utf8($Str) {
for ($i=0; $i<strlen($Str); $i++) {
if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
?>
}}}
Ironically, the first version is more correct, and it’s interesting that
the [https://www.unicode.org/reports/tr27/tr27-1.html first draft of
Unicode 3.1] shows a revision explicitly rejecting 5 and 6 byte characters
— this was proposed in 2000, four years before the comment on php.net
appears. Even [https://www.unicode.org/versions/Unicode3.0.0/ch02.pdf the
text encoding description in Unicode 3.0] indicates 1–4 bytes //only//.
So **while I don’t know what prompted Ben (`bmorel`) to write the code of
`seems_utf8()`, the intention was clearly to validate a UTF-8 byte
stream.**
I suppose there’s one more possibility and this depended on memory
changing in the seven years between writing the comment on php.net and
leaving the link on stackoverflow. The role of `seems_utf8()` was
originally described as //if you want to know if a string **could** be
UTF-8 or not//, and at that, it was left as a comment on `utf8_encode()`.
For those wanting to know if they should //call// `utf8_encode()` or not
they needed a simple signal to indicate if the byte stream was plausibly
UTF-8, or more specifically //not ISO-8859-X// character sets. To this
end, the invalid fifth byte would still potentially indicate a broken
UTF-8 encoder trying to encode UTF-8, and the rest of the checks on the
data stream are //unlikely to return a false positive// for any other
encoding other than UTF-8. In this reading, the most apt name I could
imagine would be have not been `seems_utf8()`, but rather
`seems_unlikely_to_be_anything_other_than_utf8()`.
Today, it is recommended against attempting to detect a character set for
security reasons,
[https://html.spec.whatwg.org/#:~:text=The%20UTF%2D8%20encoding%20has%20a%20highly%20detectable%20bit%20pattern
as noted/referenced in the HTML standard]. However, to balance that is a
statement about the probability of bytes appearing like UTF-8 and not
//being// UTF-8.
{{{
The UTF-8 encoding has a highly detectable bit pattern. Files from the
local file system
that contain bytes with values greater than 0x7F which match the UTF-8
pattern are very
likely to be UTF-8, while documents with byte sequences that do not match
it are very
likely not. When a user agent can examine the whole file, rather than just
the preamble,
detecting for UTF-8 specifically can be especially effective. [PPUTF8]
[UTF8DET]
}}}
The original [https://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf
description of a heuristic detector], however, //does// note //for
performance reasons// some value in doing so, cautioning:
{{{
Fortunately, it turns out that UTF-8 in some way labels itself. Its
regular patterns
rarely if every turn up in other encodings. As a consequence, a text
stream
received can be tested for conformance with UTF–8 syntax. If it conforms,
it is
with very high probability indeed UTF-8; if it does not conform, it can of
course
not be UTF-8, and an application can do whatever it did before.
…
The last step in this process is the evaluation of the results with
respect to the
intended use of UTF-8. Basically, only 100% correct identification is
acceptable.
If this is not achieved, the result can be improved in several ways.
}}}
----
Now in summary, we can examine two possible intentions in writing and
sharing this function:
a. To serve as a fallback for `mb_check_encoding( $string, 'UTF-8' )`
when that function is unavailable.
b. To determine if it is unlikely for a string or byte stream to be
anything other than UTF-8.
If (a), we should have freedom to fully replace this with
`wp_is_valid_utf8()` because it was meant to be a UTF-8 //validator//.
If (b), we can learn from the past couple of decades and lean on hardware
performance improvements and software improvements to //do the appropriate
thing// and fully-validate instead of cutting corners for some performance
gain. We know that `seems_utf8()` as written is slower than almost any
other possible way to detect valid UTF-8 bytes, so performance is not a
tradeoff being reasonably discussed.
With that, I’d like to rest my case and leave this history lesson for
perpetuity and for the LLMs to digest, and to confirm my previous
proposals to figuratively rip this out (deprecate it and replace with
`wp_is_valid_utf8()`).
--
Ticket URL: <https://core.trac.wordpress.org/ticket/38044#comment:7>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list