[wp-trac] [WordPress Trac] #34631: Extra compat for mbstring: mb_strpos()

Mon Aug 25 23:19:42 UTC 2025

#34631: Extra compat for mbstring: mb_strpos()
-------------------------------------------------+-------------------------
 Reporter:  Cybr                                 |       Owner:  (none)
     Type:  enhancement                          |      Status:  new
 Priority:  normal                               |   Milestone:  Awaiting
                                                 |  Review
Component:  Charset                              |     Version:  4.4
 Severity:  normal                               |  Resolution:
 Keywords:  has-patch needs-testing 2nd-opinion  |     Focuses:
  needs-unit-tests                               |
-------------------------------------------------+-------------------------

Comment (by dmsnell):

 It’s been a long time since this polyfill was suggested, but I think this
 could be incorporated into #63863.

 From what I understand after a quick look, `mb_strpos()` is still
 performing a byte-for-byte search. The main difference compared to
 `strpos()` is that its offset and return units are code points instead of
 bytes.

 {{{#!php
 <?php
 php > $a = "Bücher"; // This uses U+00FC, the LATIN SMALL LETTER U WITH
 DIAERESIS
 php > $b = "Bu\u{0308}cher"; // This uses U+0308, the COMBINING DIAERESIS,
 which is canonically equivalent to $a
 php > var_dump( $a, $b, strpos( $a, "ch" ), mb_strpos( $a, "ch" ), strpos(
 $b, "ch" ), mb_strpos( $b, "ch" ) );
 string(7) "Bücher"
 string(8) "Bücher"
 int(3)
 int(2)
 int(4)
 int(3)
 }}}

 What this should mean is that we can build a polyfill based on bytes and
 avoid splitting the string into an array of every character.

 {{{#!php
 <?php
 function _mb_strpos( $haystack, $needle, $offset = 0, $encoding = null ) {
         // handle the args fully in the actual code
         if ( ! is_utf8_charset( $encoding ) ) {
                 return false;
         }

         $byte_offset = _mb_codepoint_span( $haystack, 0, $offset,
 $found_offset_codepoints );
         if ( $found_offset_codepoints !== $offset ) {
                 // start is after end of string
                 return false;
         }

         $match_at_byte = strpos( $haystack, $needle, $byte_offset );
         if ( false === $match_at_byte ) {
                 return false;
         }

         $codepoints_to_match = _wp_codepoint_count( $haystack,
 $byte_offset, $match_at - $byte_offset );
         return $offset + $codepoints_to_match;
 }
 }}}

 granted the details are mostly omitted here and this assumes we merge the
 proposal currently in [https://github.com/WordPress/wordpress-
 develop/pull/9498 WordPress/wordpress-develop#9498], but it should involve
 no memory overhead and be more performant than splitting and matching
 array elements. plus, as a bonus, it should work normatively in the
 presence of invalid UTF-8.

 ideally, we’d want the search to return the same value as if we had called
 `mb_strpos( mb_scrub( $haystack ), $needle )`

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/34631#comment:5>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform