[wp-trac] [WordPress Trac] #63974: .mo file loaded as UTF-8 by default - non-standard and ignoring Content-Type headers

Mon Sep 15 14:09:07 UTC 2025

#63974: .mo file loaded as UTF-8 by default - non-standard and ignoring Content-
Type headers
--------------------------+------------------------------
 Reporter:  kkmuffme      |       Owner:  (none)
     Type:  defect (bug)  |      Status:  new
 Priority:  normal        |   Milestone:  Awaiting Review
Component:  I18N          |     Version:
 Severity:  normal        |  Resolution:
 Keywords:                |     Focuses:
--------------------------+------------------------------
Changes (by dmsnell):

 * keywords:  reporter-feedback =>

Comment:

 Thanks for the link. I thought you were talking about HTTP headers. I
 don’t know the `.mo` format that well, other than just last week I
 happened to rewrite the parser locally to avoid CPU cache thrashing it
 currently does when reading the strings.

 > in general text/plain content files by default always have ANSI encoding
 unless otherwise specified

 Since `.mo` files contain headers I guess we can talk about this, but
 alluded to in my comment earlier, files are just files and carry no
 encoding metadata or headers or content type. Regardless, most of what
 I’ve seen across various software ecosystems is an assumption of UTF-8,
 not of US-ASCII — at least, not in over a decade has anything other than
 UTF-8 been the assumption.

 Also one glaring break from this recommendation to use UTF-8 as the sole
 encoding of interchange is the way Excel handles CSV files, in which case
 it assumes that the CSV file was encoded with whatever was the default
 system encoding on the platform Excel is running on during the 90s or
 early 00s. That’s another story though.

 > What do you mean?

 This was a typo I have since corrected; I was asking if your report was in
 context of downloading files where HTTP headers would be sent alongside
 the file data.

 > If no Content-Encoding header is specified, it should be treated as
 ANSI. Since ANSI does not support multibyte characters, this means those
 should be removed. This is how msgunfmt handles it.

 I’d love to hear some varied opinion on this. At a mimimum we could check
 for an encoding indication, and we could check if the strings are valid as
 UTF-8. It looks like `gettext` started defaulting to UTF-8 production in
 mid-2023 which feels recent. Here is why I propose this instead:

  - Most tools which are unaware of character encodings (which seems to be
 most that I encounter) actually produce or assume UTF-8.
  - If validates at UTF-8 it’s highly improbably that it’s any other
 encoding.
  - UTF-8 is US-ASCII compatible so treating something as UTF-8 that was
 written with bytes 0x00–0x7F is decoding into literally the same text.

 Also as @swissspidy points out there’s another complication, which is that
 WordPress is largely encoding-agnostic. There’s a very good chance that
 //if// we get a `.mo` which bytes that are not the valid according to the
 listed heading, they may still be relevant. I am not a huge fan of this,
 but it’s part of the legacy.

 I think at a minimum though we could attempt to answer the questions:
  - Do we read an explicit encoding?
  - Do the bytes validate in that encoding?
  - Are we able to convert that encoding into UTF-8?

 That might be a nice enhancement.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63974#comment:4>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform