[wp-trac] [WordPress Trac] #27733: wpautop(): \s in regex destroys some UTF-8 characters

WordPress Trac noreply at wordpress.org
Tue Sep 23 21:56:42 UTC 2025


#27733: wpautop(): \s in regex destroys some UTF-8 characters
--------------------------------------------------+---------------------
 Reporter:  tenpura                               |       Owner:  (none)
     Type:  defect (bug)                          |      Status:  closed
 Priority:  normal                                |   Milestone:
Component:  Formatting                            |     Version:  0.71
 Severity:  major                                 |  Resolution:  fixed
 Keywords:  needs-patch needs-unit-tests wpautop  |     Focuses:
--------------------------------------------------+---------------------
Changes (by dmsnell):

 * resolution:  wontfix => fixed


Comment:

 Thanks for the investigation @miqrogroove

 In my testing I was unable to get `1 === preg_match( '/\s/', "one\xA0two"
 )` regardless of my `LC_CTYPE`, `LC_ALL`, and other charset-related ENV
 values or `php.ini` settings. I think we are in agreement that the PCRE
 functions simply don’t do anything special.

 (I //was// able to get it to match when adding the `u` flag as long as I
 updated the bytes to `\xC2\xA0` for the proper UTF-8 encoding of the `NO-
 BREAK SPACE`).

 It would be nice to know for sure that they are operating simply on bytes
 (regardless of encoding) //or// on UTF-8 if provided the UTF-8 flag.

 ----

 This makes me think that WordPress in all of its supported environments
 will not and cannot create this scenario. Does that sound right? If so, I
 believe that `wontfix` is still fine, or maybe `invalid` would be more
 appropriate.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/27733#comment:17>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list