[wp-trac] [WordPress Trac] #64151: Improve maintainability and robustness of sanitize_title_with_dashes()

Sat Oct 25 18:30:41 UTC 2025

#64151: Improve maintainability and robustness of sanitize_title_with_dashes()
-------------------------+----------------------------
 Reporter:  westonruter  |      Owner:  (none)
     Type:  enhancement  |     Status:  new
 Priority:  normal       |  Milestone:  Future Release
Component:  Formatting   |    Version:  1.2
 Severity:  normal       |   Keywords:  needs-patch
  Focuses:               |
-------------------------+----------------------------
 This is a follow-up to #64089.

 As discussed in [https://github.com/WordPress/wordpress-develop/pull/10204
 PR #10204], the `sanitize_title_with_dashes()` function is difficult to
 maintain because it has a lot of URL-encoded characters and numeric HTML
 entities

 I tried to improve the maintainability in [https://github.com/WordPress
 /wordpress-
 develop/pull/10204/commits/ff2d2a730144328591c4f654c41e06ad8499a7f6
 ff2d2a7], but my approach was not as robust as it could have been, thanks
 to [https://github.com/WordPress/wordpress-
 develop/pull/10204/files#r2453809403 feedback] from @dmsnell:

 > I strongly discourage replacements that attempt to match normative
 character references, or which mix UTF-8 characters and HTML character
 references. these lead to strange edge cases and can easily lead to
 situations where we cannot accomplish what should be allowable.
 >
 > to that end if we want to make these replacements I would encourage
 backing up to the top of this function and replacing `strip_tags()` with a
 run through the HTML API to extract the title as decoded plaintext. once
 that’s done we can examine raw UTF-8 replacements and not have to concern
 ourselves if someone wrote ` ` or `&nbsp` or `&#xA0;` or
 `&#0000000160` — all of these decode into the same U+00A0 code point.
 >
 > If not wanting to reconsider this function more holistically, this can
 still be decoded as `WP_HTML_Decoder::decode_text_node( $title )` before
 making these replacements. They can be done rather swiftly with `strtr()`.
 Further, since we are creating a static replacements array, we don’t have
 to use a potentially-missing runtime function to generate them: we can use
 Unicode string literals like `\u{2011}` for the patterns/matches.
 >
 > Also a quick side note: HTML’s named character references are case-
 sensitive, so while I am guessing the use of `str_ireplace()` is to catch
 variations like `&NBSP;`, if it actually does that it will transform
 _plaintext_ content and not the placeholder for a no-break space.

 See that entire comment thread as well as his
 [https://github.com/WordPress/wordpress-
 develop/pull/10204#pullrequestreview-3368107324 review]:

 > we recently had similar work in [https://github.com/WordPress/wordpress-
 develop/pull/9103 PR #9103] (#62995).
 >
 > […] we could consider an approach similar to that taken over there,
 which is to rely on a Unicode-supported PCRE to replace
 [https://www.unicode.org/Public/17.0.0/ucd/PropList.txt everything with
 the `Dash_Punctuation` character property], and also the
 `Space_Separator`.
 >
 > {{{
 > if ( _wp_can_use_pcre_u() ) {
 >       $title = preg_replace( '~[\p{Pd}\p{Zs}]~u', '-', $title );
 > }
 > }}}
 >
 > Over time I think it’s okay to be more and more restrictive on these,
 but I hope we push more in the direction of finding ways to ensure the
 titles and filenames more closely match the content they are associated
 with.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/64151>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform