[wp-trac] [WordPress Trac] #64842: Upload problems with Umlauts in ID3 Tags

WordPress Trac noreply at wordpress.org
Thu Mar 19 22:17:02 UTC 2026


#64842: Upload problems with Umlauts in ID3 Tags
-------------------------------------+------------------------------
 Reporter:  claireschlamm            |       Owner:  (none)
     Type:  defect (bug)             |      Status:  new
 Priority:  normal                   |   Milestone:  Awaiting Review
Component:  Upload                   |     Version:  6.9.1
 Severity:  normal                   |  Resolution:
 Keywords:  has-patch needs-testing  |     Focuses:
-------------------------------------+------------------------------
Changes (by abhishekfdd):

 * keywords:   => has-patch needs-testing


Comment:

 I was able to reproduce this. Uploading the example MP3 via **Media > Add
 Media File** fails with "Could not insert attachment into database."
 However, uploading the same file inside a post using the Audio or File
 block succeeds.

 This difference points to the two different code paths:

 - **Media Library upload** uses `media_handle_upload()` in `wp-
 admin/includes/media.php`.
 - **Block editor upload** uses the REST API (`/wp/v2/media`) via
 `WP_REST_Attachments_Controller`, which handles metadata differently.

 **Root cause:**

 The ID3v1 specification mandates ISO-8859-1 encoding for tag values.
 German umlauts like `äöüÄÖÜß` are valid ISO-8859-1 characters, but they
 are **not** valid UTF-8 byte sequences.

 The `getID3` library (bundled in `wp-includes/ID3/`) is configured with
 `$encoding = 'UTF-8'` and should convert ID3v1 tags from ISO-8859-1 to
 UTF-8. However, in certain cases — particularly when files have both ID3v1
 and ID3v2 tags, or when tag editors write non-standard encodings — the
 conversion doesn't happen correctly.

 In `wp_add_id3_tag_data()`, these potentially invalid-UTF-8 tag values are
 passed through `wp_kses_post()`, which does not fix encoding issues. The
 values then flow into `media_handle_upload()`:

 1. `$meta['title']` is assigned directly to `$title` **without**
 `sanitize_text_field()` (the filename-based title gets
 `sanitize_text_field()`, but the ID3 title does not).
 2. `$title`, `$meta['album']`, `$meta['artist']`, and `$meta['genre']` are
 interpolated into `$content` via `sprintf()`.
 3. Both `post_title` and `post_content` are passed to
 `wp_insert_attachment()` → `wp_insert_post()`.
 4. MySQL rejects the invalid UTF-8, and the insertion fails.

 **Patch:**

 Attaching `64842.3.diff` which addresses this in three ways:

 1. **Introduces `_wp_id3_ensure_utf8()`** — a private helper in
 `media.php` that detects invalid UTF-8 and converts from Windows-1252 (a
 superset of ISO-8859-1 covering the ID3v1 spec encoding). This preserves
 the actual umlaut characters rather than stripping them.
 2. **Applies the conversion in `wp_add_id3_tag_data()`** — each tag value
 is passed through `_wp_id3_ensure_utf8()` before `wp_kses_post()`, fixing
 the encoding at the source.
 3. **Adds `sanitize_text_field()` on the ID3 title** in
 `media_handle_upload()` — currently the ID3-sourced title is assigned raw,
 unlike the filename-based fallback.

 I chose `mb_convert_encoding()` with `'Windows-1252'` source encoding over
 `'ISO-8859-1'` because Windows-1252 is a strict superset (covers bytes
 `0x80–0x9F` which ISO-8859-1 leaves undefined) and is what most real-world
 tag editors actually use.

 **Testing:**

 1. Download the reporter's example file from `https://cba.media/wp-
 content/uploads/example_with_umlaut.mp3`
 2. Without patch: upload via Media > Add Media File → fails with "Could
 not insert attachment into database"
 3. With patch: upload succeeds; the attachment title and description
 preserve the German umlauts correctly
 4. Also verify that uploading the same file via the Audio/File block in
 the editor still works (no regression)
 5. Test with a file containing only ASCII ID3 tags to confirm no
 regression on normal uploads

 Note: The recent UTF-8 modernization work in #63863 (WordPress 6.9)
 improves `wp_check_invalid_utf8()` with replacement characters, but that
 function is designed for strings that are *nominally* UTF-8 with some bad
 bytes. Here the problem is that the entire string is in a *different
 encoding* (ISO-8859-1), so conversion is the correct approach rather than
 replacement.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/64842#comment:1>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list