[wp-trac] [WordPress Trac] #63974: .mo file loaded as UTF-8 by default - non-standard and ignoring Content-Type headers
WordPress Trac
noreply at wordpress.org
Mon Sep 15 14:09:07 UTC 2025
#63974: .mo file loaded as UTF-8 by default - non-standard and ignoring Content-
Type headers
--------------------------+------------------------------
Reporter: kkmuffme | Owner: (none)
Type: defect (bug) | Status: new
Priority: normal | Milestone: Awaiting Review
Component: I18N | Version:
Severity: normal | Resolution:
Keywords: | Focuses:
--------------------------+------------------------------
Changes (by dmsnell):
* keywords: reporter-feedback =>
Comment:
Thanks for the link. I thought you were talking about HTTP headers. I
don’t know the `.mo` format that well, other than just last week I
happened to rewrite the parser locally to avoid CPU cache thrashing it
currently does when reading the strings.
> in general text/plain content files by default always have ANSI encoding
unless otherwise specified
Since `.mo` files contain headers I guess we can talk about this, but
alluded to in my comment earlier, files are just files and carry no
encoding metadata or headers or content type. Regardless, most of what
I’ve seen across various software ecosystems is an assumption of UTF-8,
not of US-ASCII — at least, not in over a decade has anything other than
UTF-8 been the assumption.
Also one glaring break from this recommendation to use UTF-8 as the sole
encoding of interchange is the way Excel handles CSV files, in which case
it assumes that the CSV file was encoded with whatever was the default
system encoding on the platform Excel is running on during the 90s or
early 00s. That’s another story though.
> What do you mean?
This was a typo I have since corrected; I was asking if your report was in
context of downloading files where HTTP headers would be sent alongside
the file data.
> If no Content-Encoding header is specified, it should be treated as
ANSI. Since ANSI does not support multibyte characters, this means those
should be removed. This is how msgunfmt handles it.
I’d love to hear some varied opinion on this. At a mimimum we could check
for an encoding indication, and we could check if the strings are valid as
UTF-8. It looks like `gettext` started defaulting to UTF-8 production in
mid-2023 which feels recent. Here is why I propose this instead:
- Most tools which are unaware of character encodings (which seems to be
most that I encounter) actually produce or assume UTF-8.
- If validates at UTF-8 it’s highly improbably that it’s any other
encoding.
- UTF-8 is US-ASCII compatible so treating something as UTF-8 that was
written with bytes 0x00–0x7F is decoding into literally the same text.
Also as @swissspidy points out there’s another complication, which is that
WordPress is largely encoding-agnostic. There’s a very good chance that
//if// we get a `.mo` which bytes that are not the valid according to the
listed heading, they may still be relevant. I am not a huge fan of this,
but it’s part of the legacy.
I think at a minimum though we could attempt to answer the questions:
- Do we read an explicit encoding?
- Do the bytes validate in that encoding?
- Are we able to convert that encoding into UTF-8?
That might be a nice enhancement.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/63974#comment:4>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list