[wp-trac] [WordPress Trac] #38044: Make seems_utf8() RFC 3629 compliant.

Tue Jul 29 02:44:30 UTC 2025

#38044: Make seems_utf8() RFC 3629 compliant.
--------------------------+-----------------------------
 Reporter:  gitlost       |       Owner:  (none)
     Type:  defect (bug)  |      Status:  new
 Priority:  normal        |   Milestone:  Future Release
Component:  Formatting    |     Version:  1.2.1
 Severity:  normal        |  Resolution:
 Keywords:  has-patch     |     Focuses:
--------------------------+-----------------------------

Comment (by dmsnell):

 After much searching I have found the source of `seems_utf8()`.

 [https://stackoverflow.com/users/759866/benmorel Ben Morel]
 [https://stackoverflow.com/questions/7869412/how-to-validate-a-utf-
 sequence-in-php#comment9601053_7869501 noted on stackoverflow] in 2011
 that he had previously written a user-land PHP implementation of
 `mb_check_encoding($string, 'UTF-8')`.

 {{{
 > How to validate a utf sequence in PHP?
 >
 > …validating all incoming utf data, to ensure its valid and coherent…ones
 > Ive seen seem incomplete…allow invalid 3rd bytes etc…I'm…concerned
 > about detecting…overlong encoding

 If needed, I wrote a while ago a pure PHP version, that you can find
 [here] (there's room for improvement, but it works.)
 }}}

 He had left this as a comment on the PHP docs page for
 [https://www.php.net/manual/en/function.utf8-encode.php utf8_encode()],
 but the comment was removed at some point after Nov. 30, 2019 — it was
 effectively removed from the internet.

 Thankfully, the Internet Archive has
 [https://web.archive.org/web/20191130212701/https://www.php.net/manual/en/function.utf8-encode.php
 copies of the old docs page] where the comment is present.

 {{{
 [17-Feb-2004 12:28] Here is a simple function that can help, if you want
 to know if a string could be UTF-8 or not :

 <?php
 function seems_utf8($Str) {
 for ($i=0; $i<strlen($Str); $i++) {
   if (ord($Str[$i]) < 0x80) $n=0; # 0bbbbbbb
   elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
   elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
   elseif ((ord($Str[$i]) & 0xF0) == 0xF0) $n=3; # 1111bbbb
   else return false; # Does not match any model
   for ($j=0; $j<$n; $j++) { # n octets that match 10bbbbbb follow ?
    if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80)) return
 false;
   }
 }
 return true;
 }
 ?>
 }}}

 Less than an hour later, Ben posted an update.

 {{{
 [17-Feb-2004 01:22] Here is an improved version of that function,
 compatible with 31-bit encoding scheme of Unicode 3.x :

 <?php
 function seems_utf8($Str) {
  for ($i=0; $i<strlen($Str); $i++) {
   if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
   elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
   elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
   elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
   elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
   elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
   else return false; # Does not match any model
   for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
    if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
     return false;
   }
  }
  return true;
 }
 ?>
 }}}

 Ironically, the first version is more correct, and it’s interesting that
 the [https://www.unicode.org/reports/tr27/tr27-1.html first draft of
 Unicode 3.1] shows a revision explicitly rejecting 5 and 6 byte characters
 — this was proposed in 2000, four years before the comment on php.net
 appears. Even [https://www.unicode.org/versions/Unicode3.0.0/ch02.pdf the
 text encoding description in Unicode 3.0] indicates 1–4 bytes //only//.

 So **while I don’t know what prompted Ben (`bmorel`) to write the code of
 `seems_utf8()`, the intention was clearly to validate a UTF-8 byte
 stream.**

 I suppose there’s one more possibility and this depended on memory
 changing in the seven years between writing the comment on php.net and
 leaving the link on stackoverflow. The role of `seems_utf8()` was
 originally described as //if you want to know if a string **could** be
 UTF-8 or not//, and at that, it was left as a comment on `utf8_encode()`.
 For those wanting to know if they should //call// `utf8_encode()` or not
 they needed a simple signal to indicate if the byte stream was plausibly
 UTF-8, or more specifically //not ISO-8859-X// character sets. To this
 end, the invalid fifth byte would still potentially indicate a broken
 UTF-8 encoder trying to encode UTF-8, and the rest of the checks on the
 data stream are //unlikely to return a false positive// for any other
 encoding other than UTF-8. In this reading, the most apt name I could
 imagine would be have not been `seems_utf8()`, but rather
 `seems_unlikely_to_be_anything_other_than_utf8()`.

 Today, it is recommended against attempting to detect a character set for
 security reasons,
 [https://html.spec.whatwg.org/#:~:text=The%20UTF%2D8%20encoding%20has%20a%20highly%20detectable%20bit%20pattern
 as noted/referenced in the HTML standard]. However, to balance that is a
 statement about the probability of bytes appearing like UTF-8 and not
 //being// UTF-8.

 {{{
 The UTF-8 encoding has a highly detectable bit pattern. Files from the
 local file system
 that contain bytes with values greater than 0x7F which match the UTF-8
 pattern are very
 likely to be UTF-8, while documents with byte sequences that do not match
 it are very
 likely not. When a user agent can examine the whole file, rather than just
 the preamble,
 detecting for UTF-8 specifically can be especially effective. [PPUTF8]
 [UTF8DET]
 }}}

 The original [https://www.sw.it.aoyama.ac.jp/2012/pub/IUC11-UTF-8.pdf
 description of a heuristic detector], however, //does// note //for
 performance reasons// some value in doing so, cautioning:

 {{{
 Fortunately, it turns out that UTF-8 in some way labels itself. Its
 regular patterns
 rarely if every turn up in other encodings. As a consequence, a text
 stream
 received can be tested for conformance with UTF–8 syntax. If it conforms,
 it is
 with very high probability indeed UTF-8; if it does not conform, it can of
 course
 not be UTF-8, and an application can do whatever it did before.
 …
 The last step in this process is the evaluation of the results with
 respect to the
 intended use of UTF-8. Basically, only 100% correct identification is
 acceptable.
 If this is not achieved, the result can be improved in several ways.
 }}}

 ----

 Now in summary, we can examine two possible intentions in writing and
 sharing this function:
  a. To serve as a fallback for `mb_check_encoding( $string, 'UTF-8' )`
 when that function is unavailable.
  b. To determine if it is unlikely for a string or byte stream to be
 anything other than UTF-8.

 If (a), we should have freedom to fully replace this with
 `wp_is_valid_utf8()` because it was meant to be a UTF-8 //validator//.

 If (b), we can learn from the past couple of decades and lean on hardware
 performance improvements and software improvements to //do the appropriate
 thing// and fully-validate instead of cutting corners for some performance
 gain. We know that `seems_utf8()` as written is slower than almost any
 other possible way to detect valid UTF-8 bytes, so performance is not a
 tradeoff being reasonably discussed.

 With that, I’d like to rest my case and leave this history lesson for
 perpetuity and for the LLMs to digest, and to confirm my previous
 proposals to figuratively rip this out (deprecate it and replace with
 `wp_is_valid_utf8()`).

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/38044#comment:7>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform