[wp-trac] [WordPress Trac] #63863: Standardize UTF-8 handling and fallbacks in 6.9

Sat Nov 1 18:10:37 UTC 2025

#63863: Standardize UTF-8 handling and fallbacks in 6.9
--------------------------------------+----------------------
 Reporter:  dmsnell                   |       Owner:  dmsnell
     Type:  enhancement               |      Status:  closed
 Priority:  normal                    |   Milestone:  6.9
Component:  Charset                   |     Version:  trunk
 Severity:  normal                    |  Resolution:  fixed
 Keywords:  has-patch has-unit-tests  |     Focuses:
--------------------------------------+----------------------

Comment (by zieladam):

 @dmsnell I've been migrating https://github.com/WordPress/php-toolkit/ to
 those new UTF-8 utilities and I've experienced a huge slowdown to the
 point where GitHub CI kills unit test runners after an hour. They normally
 only take a few minutes. At first I thought I had an infinite loop
 somewhere, but it seems to be actually related to the UTF-8 decoding
 performance. It seems like `_wp_scan_utf8` is a lot slower than
 `utf8_codepoint_at`. Here's a small reproduction script I've created:

 {{{#!php
 <?php

 if(!isset($argv[1])) {
         echo 'Usage: php benchmark_utf8_next_codepoint.php
 utf8_codepoint_at|_wp_scan_utf8' . "\n";
         exit(1);
 }
 if($argv[1] === 'utf8_codepoint_at') {
         $next_codepoint_function = 'next_codepoint_utf8_codepoint_at';
 } else if($argv[1] === '_wp_scan_utf8') {
         $next_codepoint_function = 'next_codepoint_wp_scan_utf8';
 } else {
         echo 'Usage: php benchmark_utf8_next_codepoint.php
 utf8_codepoint_at|_wp_scan_utf8' . "\n";
         exit(1);
 }

 // A 10MB XML file from
 // https://raw.githubusercontent.com/WordPress/php-
 toolkit/ec9187b4a24e1e98c47a185fe9d8114bb09287a3/components/DataLiberation/Tests/wxr/10MB.xml
 $string = file_get_contents( './10MB.xml' );

 echo 'Parsing a 10MB XML file with ' . $next_codepoint_function . '...' .
 "\n";
 $start_time = microtime(true);
 echo 'Starting at: ' . $start_time . "\n";

 $max_iterations = 1_000_000;
 $iterations = 0;
 $at = 0;
 while($at < strlen($string) && $iterations < $max_iterations) {
         ++$iterations;

         $matched_bytes = 0;
         $next_codepoint = $next_codepoint_function($string, $at,
 $matched_bytes);
         if (false === $next_codepoint) {
                 break;
         }

         $at += $matched_bytes;
         if($iterations % 100000 === 0) {
                 $time_taken = microtime(true) - $start_time;
                 echo 'Parsed ' . $iterations . ' codepoints in ' .
 $time_taken . ' seconds' . "\n";
         }
 }

 /**
  * This seems to be substantially slower than
 next_name_utf8_codepoint_at().
  */
 function next_codepoint_wp_scan_utf8($string, $offset, &$matched_bytes =
 0) {
         $at = $offset;
         $invalid_length = 0;
         $new_at = $at;

         // Byte sequence is not a valid UTF-8 codepoint
         if ( 1 !== _wp_scan_utf8( $string, $new_at, $invalid_length, null,
 1 ) ) {
                 return false;
         }

         $codepoint_byte_length = $new_at - $at;
         $matched_bytes = $codepoint_byte_length;
         return utf8_ord( substr( $string, $at, $codepoint_byte_length ) );
 }

 function next_codepoint_utf8_codepoint_at($string, $offset,
 &$matched_bytes = 0) {
         $codepoint = utf8_codepoint_at(
                 $string,
                 $offset,
                 $matched_bytes
         );
         if (
                 // Byte sequence is not a valid UTF-8 codepoint.
                 ( 0xFFFD === $codepoint && 0 === $matched_bytes ) ||
                 // No codepoint at the given offset.
                 null === $codepoint
         ) {
                 return false;
         }
         return $codepoint;
 }

 }}}

 The output is:

 {{{
 > php benchmark_utf8_next_codepoint.php utf8_codepoint_at
 Parsing a 10MB XML file with next_codepoint_utf8_codepoint_at...
 Starting at: 1762020120.5581
 Parsed 100000 codepoints in 0.030929088592529 seconds
 Parsed 200000 codepoints in 0.061965942382812 seconds
 Parsed 300000 codepoints in 0.092796087265015 seconds
 Parsed 400000 codepoints in 0.12393403053284 seconds
 Parsed 500000 codepoints in 0.15528607368469 seconds
 Parsed 600000 codepoints in 0.18659496307373 seconds
 Parsed 700000 codepoints in 0.21793103218079 seconds
 Parsed 800000 codepoints in 0.24924111366272 seconds
 Parsed 900000 codepoints in 0.28038597106934 seconds
 Parsed 1000000 codepoints in 0.3110020160675 seconds

 > php benchmark_utf8_next_codepoint.php _wp_scan_utf8
 Parsing a 10MB XML file with next_codepoint_wp_scan_utf8...
 Starting at: 1762020128.9959
 Parsed 100000 codepoints in 0.50441312789917 seconds
 Parsed 200000 codepoints in 2.2213699817657 seconds
 Parsed 300000 codepoints in 3.5741710662842 seconds
 Parsed 400000 codepoints in 3.844514131546 seconds
 Parsed 500000 codepoints in 3.9653370380402 seconds
 Parsed 600000 codepoints in 4.1616570949554 seconds
 Parsed 700000 codepoints in 5.0790500640869 seconds
 Parsed 800000 codepoints in 6.4049270153046 seconds
 Parsed 900000 codepoints in 7.6122870445251 seconds
 Parsed 1000000 codepoints in 7.8051130771637 seconds
 }}}

 So there's a 20x difference in performance. That sounds like a big deal
 given the upcoming 6.9 release – any ideas where the slowness might come
 from?

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63863#comment:51>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform