[wp-trac] [WordPress Trac] #62172: Deprecate non-UTF-8 Support
WordPress Trac
noreply at wordpress.org
Fri Jan 2 12:31:59 UTC 2026
#62172: Deprecate non-UTF-8 Support
-------------------------+-----------------------------
Reporter: dmsnell | Owner: (none)
Type: enhancement | Status: new
Priority: normal | Milestone: Future Release
Component: General | Version: 6.7
Severity: normal | Resolution:
Keywords: | Focuses:
-------------------------+-----------------------------
Comment (by dmsnell):
While working on #64427 I realized that the full set of text encodings we
likely //want// to //ever// support is that given in the
[https://encoding.spec.whatwg.org/#replacement `WHATWG Encoding`
specification]. It’s possible that some WordPress installations interact
with content from other encodings, but this is the set a browser
will/should recognize and operate on.
There are 37 encodings, one slight variant, a fake encoding that shifts
non-US-ASCII into the Private Use Area, and a `replacement` encoding which
//always fails// and produces an empty string (mitigating security issues
from legacy and tricky encodings).
From my own laptop with PHP 8.5 installed with the `mbstring` and `intl`
extensions, most of these are supported. Given the list, I believe it’s
feasible for us //to polyfill text conversion// among these encodings,
removing our dependence on the PHP extensions to process them. Polyfills
would be slow, but could follow the native-by-default approach taken with
UTF-8 support in WordPress 6.9.
In any potential future where we prune support for non-UTF-8 we might have
a plausible phase-out mechanism:
- Dynamically convert content from the database into UTF-8. It’s
generally [https://fluffyandflakey.blog/2024/09/18/dont-convert-html-text-
encoding/ not safe to text-convert HTML] but probably safe-enough to do so
for a support phase-out.
- Provide an option in Site Health to backup and migrate a site to UTF-8.
- Provide a “dry run” check to see if a site can be safely migrated
without data loss.
Further, we could start with simpler steps: ensure that WXR exports are
fully and universally UTF-8, performing the costlier-but-more-reliable
HTML conversion which is syntax-aware, as highlighted in the linked blog
post.
At one point I ran some analysis on //declared// charset for HTML sites on
the web, however, I did not inspect the `Content-type` header in that
analysis. I will attempt to rerun the analysis at some point to understand
at-large encodings better. Of all of the top N sites on the Internet, for
each one:
- Does the document contain non-US-ASCII?
- Does it parse as valid UTF-8?
- What is the set of declared encodings (a surprising number of websites
report multiple and conflicting encodings, like UTF-8 //and// UTF-16)?
- Does it parse as valid in each of the declared encodings?
- What does the HTML [https://html.spec.whatwg.org/multipage/parsing.html
#determining-the-character-encoding Determining the Character Encoding]
algorithm report?
- Does it parse in the detected encoding?
[[Image(https://github.com/user-attachments/assets/03be3aee-d9db-471e-
81bc-bbadc7186c6a)]]
--
Ticket URL: <https://core.trac.wordpress.org/ticket/62172#comment:3>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list