[wp-trac] [WordPress Trac] #27733: wpautop(): \s in regex destroys some UTF-8 characters

WordPress Trac noreply at wordpress.org
Mon Sep 22 22:03:11 UTC 2025


#27733: wpautop(): \s in regex destroys some UTF-8 characters
--------------------------------------------------+----------------------
 Reporter:  tenpura                               |       Owner:  (none)
     Type:  defect (bug)                          |      Status:  closed
 Priority:  normal                                |   Milestone:
Component:  Formatting                            |     Version:  0.71
 Severity:  major                                 |  Resolution:  wontfix
 Keywords:  needs-patch needs-unit-tests wpautop  |     Focuses:
--------------------------------------------------+----------------------
Changes (by dmsnell):

 * status:  new => closed
 * resolution:   => wontfix


Comment:

 the problem here is probably not really that we have `\s` but rather that
 we’re mixing encodings, right?

 on a system whose internal encoding is something like `latin1` we may get
 U+00A0 encoded as 0xA0, which is what the PCRE pattern will incorporate as
 a no-break space. Adding an arbitrarily limited set of space characters
 //appears// to resolve this problem because that particular offending byte
 is no longer caught, but there are a thousand other places different bytes
 will trip up.

 on systems with UTF-8 as their internal encoding, however ,the no-break
 space will be encoded as 0xC2 0xA0 and the PCRE pattern will look for
 that. it won’t mangle the `ム`.

 given that UTF-8 is the default internal encoding in PHP and has been for
 years I’m inclined to close this as it shouldn’t practically be an issue
 any more. if we wanted to resolve it fully we’d have to check every place
 we call string-related functions for which encoding is going in and which
 is set as the default. this is an unfeasible task.

 for that I think it would fall nicely as a duplicate of #62172. if we
 acknowledge that UTF-8 is the only actual supported encoding, this bug
 cannot appear. it’s really the obligation of whoever is integrating the
 database, server, PHP code, and plugins to ensure proper harmony between
 various text encodings.

 going to mark as `wontfix` for now. feel free to re-open if you disagree.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/27733#comment:13>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list