[wp-trac] [WordPress Trac] #63913: WordPress assumes that the UTF-8 PCRE flag is available.

WordPress Trac noreply at wordpress.org
Wed Sep 3 05:49:38 UTC 2025


#63913: WordPress assumes that the UTF-8 PCRE flag is available.
-------------------------+-----------------------------
 Reporter:  dmsnell      |       Owner:  (none)
     Type:  enhancement  |      Status:  new
 Priority:  low          |   Milestone:  Future Release
Component:  Charset      |     Version:  trunk
 Severity:  normal       |  Resolution:
 Keywords:               |     Focuses:
-------------------------+-----------------------------

Comment (by tusharbharti):

 hi @dmsnell, thanks for mention,
 for

 {{{#!php
 <?php
 if ( preg_match( '/\p{Han}|\p{Hiragana}|\p{Katakana}|\p{Hangul}/u', $name
 ) || false === strpos( $name, ' ' ) ) {
         $initials = mb_substr( $name, 0, min( 2, mb_strlen( $name, 'UTF-8'
 ) ), 'UTF-8' );
 } else {
         $first    = mb_substr( $name, 0, 1, 'UTF-8' );
         $last     = mb_substr( $name, strrpos( $name, ' ' ) + 1, 1,
 'UTF-8' );
         $initials = $first . $last;
 }
 }}}

 We can possibly use `IntlChar` class to detect the script and get the
 initials
 {{{#!php
 <?php
 $firstChar = mb_substr( $name, 0, 1, 'UTF-8' );
 $codepoint = IntlChar::ord( $firstChar );
 $block     = IntlChar::getBlockCode( $codepoint );

 $cjkBlocks = array(
     IntlChar::BLOCK_CODE_CJK_UNIFIED_IDEOGRAPHS,
     IntlChar::BLOCK_CODE_HANGUL_SYLLABLES,
     IntlChar::BLOCK_CODE_HIRAGANA,
     IntlChar::BLOCK_CODE_KATAKANA,
     );

 if ( in_array( $block, $cjkBlocks, true ) ) {
     $initials = mb_substr( $name, 0, min( 2, mb_strlen( $name, 'UTF-8' )
 ), 'UTF-8' );
 }
 }}}

 > An incomplete PCRE pattern I used to find sources in the codebase using
 the /u flag follows. It would be better to build a search based off of a
 PHP parser, but finding a comprehensive list of places assuming the UTF-8
 flag is left as an exercise for future work on this ticket.
 >
 > {{{('|")((?!\1)[^_a-
 zA-Z0-9-])((?!\1).)+\2[idsxumrADSUXJ]*?u[idsxumrADSUXJ]*?\1}}}
 > This looks for string literals starting and ending with the same
 delimiter, followed by a set of PCRE modifiers including the u, terminated
 by the opening quote. It does not find HEREDOC or NOWDOC patterns and it
 does not find unmatched delimiters like parentheses or other brackets.

 Hmm, I will see if I can write the scanner but I can meantime improve the
 regex.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63913#comment:1>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list