[wp-trac] [WordPress Trac] #63913: WordPress assumes that the UTF-8 PCRE flag is available.

Tue Sep 2 23:21:29 UTC 2025

#63913: WordPress assumes that the UTF-8 PCRE flag is available.
-------------------------+----------------------------
 Reporter:  dmsnell      |      Owner:  (none)
     Type:  enhancement  |     Status:  new
 Priority:  low          |  Milestone:  Future Release
Component:  Charset      |    Version:  trunk
 Severity:  normal       |   Keywords:
  Focuses:               |
-------------------------+----------------------------
 There are several places in Core where code assumes that the UTF-8 flag
 (`PCRE_UTF8`) is available and performs matching or substitution with no
 alternative. When this happens, the `preg_` functions fail silently. Data
 corruption ensues in ways which are hard to track to this source.

 The `_wp_can_use_pcre_u()` function indicates support for this flag, but
 resolving this issue is not as simple as wrapping the `preg_` calls in a
 check. The existing code provides no fallback mechanism and it may be
 necessary to define such in the absence of support for the flag.

 One such example is `shortcode_parse_atts()`.

 {{{#!php
 <?php
 $text    = preg_replace( "/[\x{00a0}\x{200b}]+/u", ' ', $text );
 }}}

 In this case, when `PCRE_UTF8` is not supported, `$text` becomes `NULL`
 and the function fails to parse any shortcode attributes. This is an
 example of a function which doesn’t require UTF-8 regex patterns, because
 the above replacement is trivial to replace.

 A more complicated example is `get_avatar_data()`, which searches based on
 script properties of characters which has no replacement. In the lack of
 support there, it may not be possible to provide an equivalent fallback.

 {{{#!php
 <?php
 if ( preg_match( '/\p{Han}|\p{Hiragana}|\p{Katakana}|\p{Hangul}/u', $name
 ) || false === strpos( $name, ' ' ) ) {
         $initials = mb_substr( $name, 0, min( 2, mb_strlen( $name, 'UTF-8'
 ) ), 'UTF-8' );
 } else {
         $first    = mb_substr( $name, 0, 1, 'UTF-8' );
         $last     = mb_substr( $name, strrpos( $name, ' ' ) + 1, 1,
 'UTF-8' );
         $initials = $first . $last;
 }
 }}}

 An incomplete PCRE pattern I used to find sources in the codebase using
 the `/u` flag follows. It would be better to build a search based off of a
 PHP parser, but finding a comprehensive list of places assuming the UTF-8
 flag is left as an exercise for future work on this ticket.

 {{{
 ('|")((?!\1)[^_a-
 zA-Z0-9-])((?!\1).)+\2[idsxumrADSUXJ]*?u[idsxumrADSUXJ]*?\1
 }}}

 This looks for string literals starting and ending with the same
 delimiter, followed by a set of PCRE modifiers including the `u`,
 terminated by the opening quote. It does not find `HEREDOC` or `NOWDOC`
 patterns and it does not find unmatched delimiters like parentheses or
 other brackets.

 h/t @tusharbharti

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63913>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform