[wp-trac] [WordPress Trac] #63913: WordPress assumes that the UTF-8 PCRE flag is available.
WordPress Trac
noreply at wordpress.org
Tue Sep 2 23:21:29 UTC 2025
#63913: WordPress assumes that the UTF-8 PCRE flag is available.
-------------------------+----------------------------
Reporter: dmsnell | Owner: (none)
Type: enhancement | Status: new
Priority: low | Milestone: Future Release
Component: Charset | Version: trunk
Severity: normal | Keywords:
Focuses: |
-------------------------+----------------------------
There are several places in Core where code assumes that the UTF-8 flag
(`PCRE_UTF8`) is available and performs matching or substitution with no
alternative. When this happens, the `preg_` functions fail silently. Data
corruption ensues in ways which are hard to track to this source.
The `_wp_can_use_pcre_u()` function indicates support for this flag, but
resolving this issue is not as simple as wrapping the `preg_` calls in a
check. The existing code provides no fallback mechanism and it may be
necessary to define such in the absence of support for the flag.
One such example is `shortcode_parse_atts()`.
{{{#!php
<?php
$text = preg_replace( "/[\x{00a0}\x{200b}]+/u", ' ', $text );
}}}
In this case, when `PCRE_UTF8` is not supported, `$text` becomes `NULL`
and the function fails to parse any shortcode attributes. This is an
example of a function which doesn’t require UTF-8 regex patterns, because
the above replacement is trivial to replace.
A more complicated example is `get_avatar_data()`, which searches based on
script properties of characters which has no replacement. In the lack of
support there, it may not be possible to provide an equivalent fallback.
{{{#!php
<?php
if ( preg_match( '/\p{Han}|\p{Hiragana}|\p{Katakana}|\p{Hangul}/u', $name
) || false === strpos( $name, ' ' ) ) {
$initials = mb_substr( $name, 0, min( 2, mb_strlen( $name, 'UTF-8'
) ), 'UTF-8' );
} else {
$first = mb_substr( $name, 0, 1, 'UTF-8' );
$last = mb_substr( $name, strrpos( $name, ' ' ) + 1, 1,
'UTF-8' );
$initials = $first . $last;
}
}}}
An incomplete PCRE pattern I used to find sources in the codebase using
the `/u` flag follows. It would be better to build a search based off of a
PHP parser, but finding a comprehensive list of places assuming the UTF-8
flag is left as an exercise for future work on this ticket.
{{{
('|")((?!\1)[^_a-
zA-Z0-9-])((?!\1).)+\2[idsxumrADSUXJ]*?u[idsxumrADSUXJ]*?\1
}}}
This looks for string literals starting and ending with the same
delimiter, followed by a set of PCRE modifiers including the `u`,
terminated by the opening quote. It does not find `HEREDOC` or `NOWDOC`
patterns and it does not find unmatched delimiters like parentheses or
other brackets.
h/t @tusharbharti
--
Ticket URL: <https://core.trac.wordpress.org/ticket/63913>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list