[wp-trac] [WordPress Trac] #63863: Standardize UTF-8 handling and fallbacks in 6.9
WordPress Trac
noreply at wordpress.org
Sat Nov 1 18:10:37 UTC 2025
#63863: Standardize UTF-8 handling and fallbacks in 6.9
--------------------------------------+----------------------
Reporter: dmsnell | Owner: dmsnell
Type: enhancement | Status: closed
Priority: normal | Milestone: 6.9
Component: Charset | Version: trunk
Severity: normal | Resolution: fixed
Keywords: has-patch has-unit-tests | Focuses:
--------------------------------------+----------------------
Comment (by zieladam):
@dmsnell I've been migrating https://github.com/WordPress/php-toolkit/ to
those new UTF-8 utilities and I've experienced a huge slowdown to the
point where GitHub CI kills unit test runners after an hour. They normally
only take a few minutes. At first I thought I had an infinite loop
somewhere, but it seems to be actually related to the UTF-8 decoding
performance. It seems like `_wp_scan_utf8` is a lot slower than
`utf8_codepoint_at`. Here's a small reproduction script I've created:
{{{#!php
<?php
if(!isset($argv[1])) {
echo 'Usage: php benchmark_utf8_next_codepoint.php
utf8_codepoint_at|_wp_scan_utf8' . "\n";
exit(1);
}
if($argv[1] === 'utf8_codepoint_at') {
$next_codepoint_function = 'next_codepoint_utf8_codepoint_at';
} else if($argv[1] === '_wp_scan_utf8') {
$next_codepoint_function = 'next_codepoint_wp_scan_utf8';
} else {
echo 'Usage: php benchmark_utf8_next_codepoint.php
utf8_codepoint_at|_wp_scan_utf8' . "\n";
exit(1);
}
// A 10MB XML file from
// https://raw.githubusercontent.com/WordPress/php-
toolkit/ec9187b4a24e1e98c47a185fe9d8114bb09287a3/components/DataLiberation/Tests/wxr/10MB.xml
$string = file_get_contents( './10MB.xml' );
echo 'Parsing a 10MB XML file with ' . $next_codepoint_function . '...' .
"\n";
$start_time = microtime(true);
echo 'Starting at: ' . $start_time . "\n";
$max_iterations = 1_000_000;
$iterations = 0;
$at = 0;
while($at < strlen($string) && $iterations < $max_iterations) {
++$iterations;
$matched_bytes = 0;
$next_codepoint = $next_codepoint_function($string, $at,
$matched_bytes);
if (false === $next_codepoint) {
break;
}
$at += $matched_bytes;
if($iterations % 100000 === 0) {
$time_taken = microtime(true) - $start_time;
echo 'Parsed ' . $iterations . ' codepoints in ' .
$time_taken . ' seconds' . "\n";
}
}
/**
* This seems to be substantially slower than
next_name_utf8_codepoint_at().
*/
function next_codepoint_wp_scan_utf8($string, $offset, &$matched_bytes =
0) {
$at = $offset;
$invalid_length = 0;
$new_at = $at;
// Byte sequence is not a valid UTF-8 codepoint
if ( 1 !== _wp_scan_utf8( $string, $new_at, $invalid_length, null,
1 ) ) {
return false;
}
$codepoint_byte_length = $new_at - $at;
$matched_bytes = $codepoint_byte_length;
return utf8_ord( substr( $string, $at, $codepoint_byte_length ) );
}
function next_codepoint_utf8_codepoint_at($string, $offset,
&$matched_bytes = 0) {
$codepoint = utf8_codepoint_at(
$string,
$offset,
$matched_bytes
);
if (
// Byte sequence is not a valid UTF-8 codepoint.
( 0xFFFD === $codepoint && 0 === $matched_bytes ) ||
// No codepoint at the given offset.
null === $codepoint
) {
return false;
}
return $codepoint;
}
}}}
The output is:
{{{
> php benchmark_utf8_next_codepoint.php utf8_codepoint_at
Parsing a 10MB XML file with next_codepoint_utf8_codepoint_at...
Starting at: 1762020120.5581
Parsed 100000 codepoints in 0.030929088592529 seconds
Parsed 200000 codepoints in 0.061965942382812 seconds
Parsed 300000 codepoints in 0.092796087265015 seconds
Parsed 400000 codepoints in 0.12393403053284 seconds
Parsed 500000 codepoints in 0.15528607368469 seconds
Parsed 600000 codepoints in 0.18659496307373 seconds
Parsed 700000 codepoints in 0.21793103218079 seconds
Parsed 800000 codepoints in 0.24924111366272 seconds
Parsed 900000 codepoints in 0.28038597106934 seconds
Parsed 1000000 codepoints in 0.3110020160675 seconds
> php benchmark_utf8_next_codepoint.php _wp_scan_utf8
Parsing a 10MB XML file with next_codepoint_wp_scan_utf8...
Starting at: 1762020128.9959
Parsed 100000 codepoints in 0.50441312789917 seconds
Parsed 200000 codepoints in 2.2213699817657 seconds
Parsed 300000 codepoints in 3.5741710662842 seconds
Parsed 400000 codepoints in 3.844514131546 seconds
Parsed 500000 codepoints in 3.9653370380402 seconds
Parsed 600000 codepoints in 4.1616570949554 seconds
Parsed 700000 codepoints in 5.0790500640869 seconds
Parsed 800000 codepoints in 6.4049270153046 seconds
Parsed 900000 codepoints in 7.6122870445251 seconds
Parsed 1000000 codepoints in 7.8051130771637 seconds
}}}
So there's a 20x difference in performance. That sounds like a big deal
given the upcoming 6.9 release – any ideas where the slowness might come
from?
--
Ticket URL: <https://core.trac.wordpress.org/ticket/63863#comment:51>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list