[wp-trac] [WordPress Trac] #34631: Extra compat for mbstring: mb_strpos()
WordPress Trac
noreply at wordpress.org
Mon Aug 25 23:19:42 UTC 2025
#34631: Extra compat for mbstring: mb_strpos()
-------------------------------------------------+-------------------------
Reporter: Cybr | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: Awaiting
| Review
Component: Charset | Version: 4.4
Severity: normal | Resolution:
Keywords: has-patch needs-testing 2nd-opinion | Focuses:
needs-unit-tests |
-------------------------------------------------+-------------------------
Comment (by dmsnell):
It’s been a long time since this polyfill was suggested, but I think this
could be incorporated into #63863.
From what I understand after a quick look, `mb_strpos()` is still
performing a byte-for-byte search. The main difference compared to
`strpos()` is that its offset and return units are code points instead of
bytes.
{{{#!php
<?php
php > $a = "Bücher"; // This uses U+00FC, the LATIN SMALL LETTER U WITH
DIAERESIS
php > $b = "Bu\u{0308}cher"; // This uses U+0308, the COMBINING DIAERESIS,
which is canonically equivalent to $a
php > var_dump( $a, $b, strpos( $a, "ch" ), mb_strpos( $a, "ch" ), strpos(
$b, "ch" ), mb_strpos( $b, "ch" ) );
string(7) "Bücher"
string(8) "Bücher"
int(3)
int(2)
int(4)
int(3)
}}}
What this should mean is that we can build a polyfill based on bytes and
avoid splitting the string into an array of every character.
{{{#!php
<?php
function _mb_strpos( $haystack, $needle, $offset = 0, $encoding = null ) {
// handle the args fully in the actual code
if ( ! is_utf8_charset( $encoding ) ) {
return false;
}
$byte_offset = _mb_codepoint_span( $haystack, 0, $offset,
$found_offset_codepoints );
if ( $found_offset_codepoints !== $offset ) {
// start is after end of string
return false;
}
$match_at_byte = strpos( $haystack, $needle, $byte_offset );
if ( false === $match_at_byte ) {
return false;
}
$codepoints_to_match = _wp_codepoint_count( $haystack,
$byte_offset, $match_at - $byte_offset );
return $offset + $codepoints_to_match;
}
}}}
granted the details are mostly omitted here and this assumes we merge the
proposal currently in [https://github.com/WordPress/wordpress-
develop/pull/9498 WordPress/wordpress-develop#9498], but it should involve
no memory overhead and be more performant than splitting and matching
array elements. plus, as a bonus, it should work normatively in the
presence of invalid UTF-8.
ideally, we’d want the search to return the same value as if we had called
`mb_strpos( mb_scrub( $haystack ), $needle )`
--
Ticket URL: <https://core.trac.wordpress.org/ticket/34631#comment:5>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list