[wp-trac] [WordPress Trac] #62172: Deprecate non-UTF-8 Support

Fri Jan 2 12:31:59 UTC 2026

#62172: Deprecate non-UTF-8 Support
-------------------------+-----------------------------
 Reporter:  dmsnell      |       Owner:  (none)
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:  Future Release
Component:  General      |     Version:  6.7
 Severity:  normal       |  Resolution:
 Keywords:               |     Focuses:
-------------------------+-----------------------------

Comment (by dmsnell):

 While working on #64427 I realized that the full set of text encodings we
 likely //want// to //ever// support is that given in the
 [https://encoding.spec.whatwg.org/#replacement `WHATWG Encoding`
 specification]. It’s possible that some WordPress installations interact
 with content from other encodings, but this is the set a browser
 will/should recognize and operate on.

 There are 37 encodings, one slight variant, a fake encoding that shifts
 non-US-ASCII into the Private Use Area, and a `replacement` encoding which
 //always fails// and produces an empty string (mitigating security issues
 from legacy and tricky encodings).

 From my own laptop with PHP 8.5 installed with the `mbstring` and `intl`
 extensions, most of these are supported. Given the list, I believe it’s
 feasible for us //to polyfill text conversion// among these encodings,
 removing our dependence on the PHP extensions to process them. Polyfills
 would be slow, but could follow the native-by-default approach taken with
 UTF-8 support in WordPress 6.9.

 In any potential future where we prune support for non-UTF-8 we might have
 a plausible phase-out mechanism:
  - Dynamically convert content from the database into UTF-8. It’s
 generally [https://fluffyandflakey.blog/2024/09/18/dont-convert-html-text-
 encoding/ not safe to text-convert HTML] but probably safe-enough to do so
 for a support phase-out.
  - Provide an option in Site Health to backup and migrate a site to UTF-8.
  - Provide a “dry run” check to see if a site can be safely migrated
 without data loss.

 Further, we could start with simpler steps: ensure that WXR exports are
 fully and universally UTF-8, performing the costlier-but-more-reliable
 HTML conversion which is syntax-aware, as highlighted in the linked blog
 post.

 At one point I ran some analysis on //declared// charset for HTML sites on
 the web, however, I did not inspect the `Content-type` header in that
 analysis. I will attempt to rerun the analysis at some point to understand
 at-large encodings better. Of all of the top N sites on the Internet, for
 each one:

  - Does the document contain non-US-ASCII?
  - Does it parse as valid UTF-8?
  - What is the set of declared encodings (a surprising number of websites
 report multiple and conflicting encodings, like UTF-8 //and// UTF-16)?
  - Does it parse as valid in each of the declared encodings?
  - What does the HTML [https://html.spec.whatwg.org/multipage/parsing.html
 #determining-the-character-encoding Determining the Character Encoding]
 algorithm report?
  - Does it parse in the detected encoding?

 [[Image(https://github.com/user-attachments/assets/03be3aee-d9db-471e-
 81bc-bbadc7186c6a)]]

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/62172#comment:3>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform