[wp-trac] [WordPress Trac] #64151: Improve maintainability and robustness of sanitize_title_with_dashes()
WordPress Trac
noreply at wordpress.org
Sat Oct 25 18:30:41 UTC 2025
#64151: Improve maintainability and robustness of sanitize_title_with_dashes()
-------------------------+----------------------------
Reporter: westonruter | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: Future Release
Component: Formatting | Version: 1.2
Severity: normal | Keywords: needs-patch
Focuses: |
-------------------------+----------------------------
This is a follow-up to #64089.
As discussed in [https://github.com/WordPress/wordpress-develop/pull/10204
PR #10204], the `sanitize_title_with_dashes()` function is difficult to
maintain because it has a lot of URL-encoded characters and numeric HTML
entities
I tried to improve the maintainability in [https://github.com/WordPress
/wordpress-
develop/pull/10204/commits/ff2d2a730144328591c4f654c41e06ad8499a7f6
ff2d2a7], but my approach was not as robust as it could have been, thanks
to [https://github.com/WordPress/wordpress-
develop/pull/10204/files#r2453809403 feedback] from @dmsnell:
> I strongly discourage replacements that attempt to match normative
character references, or which mix UTF-8 characters and HTML character
references. these lead to strange edge cases and can easily lead to
situations where we cannot accomplish what should be allowable.
>
> to that end if we want to make these replacements I would encourage
backing up to the top of this function and replacing `strip_tags()` with a
run through the HTML API to extract the title as decoded plaintext. once
that’s done we can examine raw UTF-8 replacements and not have to concern
ourselves if someone wrote ` ` or ` ` or ` ` or
` ` — all of these decode into the same U+00A0 code point.
>
> If not wanting to reconsider this function more holistically, this can
still be decoded as `WP_HTML_Decoder::decode_text_node( $title )` before
making these replacements. They can be done rather swiftly with `strtr()`.
Further, since we are creating a static replacements array, we don’t have
to use a potentially-missing runtime function to generate them: we can use
Unicode string literals like `\u{2011}` for the patterns/matches.
>
> Also a quick side note: HTML’s named character references are case-
sensitive, so while I am guessing the use of `str_ireplace()` is to catch
variations like ` `, if it actually does that it will transform
_plaintext_ content and not the placeholder for a no-break space.
See that entire comment thread as well as his
[https://github.com/WordPress/wordpress-
develop/pull/10204#pullrequestreview-3368107324 review]:
> we recently had similar work in [https://github.com/WordPress/wordpress-
develop/pull/9103 PR #9103] (#62995).
>
> […] we could consider an approach similar to that taken over there,
which is to rely on a Unicode-supported PCRE to replace
[https://www.unicode.org/Public/17.0.0/ucd/PropList.txt everything with
the `Dash_Punctuation` character property], and also the
`Space_Separator`.
>
> {{{
> if ( _wp_can_use_pcre_u() ) {
> $title = preg_replace( '~[\p{Pd}\p{Zs}]~u', '-', $title );
> }
> }}}
>
> Over time I think it’s okay to be more and more restrictive on these,
but I hope we push more in the direction of finding ways to ensure the
titles and filenames more closely match the content they are associated
with.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/64151>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list