[wp-trac] [WordPress Trac] #36610: Loss of multibyte category and tag names

WordPress Trac noreply at wordpress.org
Wed Apr 20 22:40:42 UTC 2016


#36610: Loss of multibyte category and tag names
--------------------------+-----------------------------
 Reporter:  cfinke        |      Owner:
     Type:  defect (bug)  |     Status:  new
 Priority:  normal        |  Milestone:  Awaiting Review
Component:  Taxonomy      |    Version:  trunk
 Severity:  normal        |   Keywords:
  Focuses:                |
--------------------------+-----------------------------
 Some multibyte category and tag names can be lost during creation.

 Example: create a category with the name `テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
 テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテAAA`. It is 201 bytes long and will be
 truncated by `$wpdb->strip_invalid_text_for_column()` to 200 bytes (`テテテテテ
 テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテAA`) before
 the category is created.

 However, the category name `AAAテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
 テテテテテテテテテテテテテテテテテテテテテテテ` is also 201 bytes, but when it is truncated to
 200 bytes, it splits a multibyte character, so when
 `wp_check_invalid_utf8()` gets called, it will truncate the string to zero
 bytes out of an abundance of caution, since the string ends with something
 that is not valid utf8.

 It's clear that the category creator was not submitting invalid utf8, and
 the true goal of `$wpdb->strip_invalid_text_for_column()` was to ensure
 that the text would fit in the DB column without auto-truncation by the DB
 engine, so the ideal behavior should be that the string is truncated to
 the longest possible length that remains valid and fits within the column.

 One way to get around this data loss would be a wrapper around
 `wp_check_invalid_utf8()`. If `wp_check_invalid_utf8()` fails, chop a
 single byte off the end of the string and check it again, up to the point
 where you have checked the string without the last five bytes (as I
 believe that the longest a single character can be is six bytes, although
 I'm not positive about that and I think anything longer than four bytes is
 mostly theoretical). Or, fix `$wpdb->strip_invalid_text_for_column()` so
 that it doesn't truncate in the middle of a multibyte character.

 There might be a solution lurking in mb_strlen(). If
 `wp_check_invalid_utf8()` returns an empty string, take bytes off of the
 original string (up to 5 bytes) until `mb_strlen()` returns a smaller
 number and then try `wp_check_invalid_utf8()`.

 Configuration details: Tested in WordPress trunk (4.5-RC1-37153) and PHP
 5.2.17

 Here's my `wp_terms` structure:

 {{{
 CREATE TABLE `wp_terms` (
   `term_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
   `name` varchar(200) NOT NULL DEFAULT '',
   `slug` varchar(200) NOT NULL DEFAULT '',
   `term_group` bigint(10) NOT NULL DEFAULT '0',
   PRIMARY KEY (`term_id`),
   KEY `slug` (`slug`(191)),
   KEY `name` (`name`(191))
 ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
 }}}

 See #36393 for discussion of a similar (but now-fixed) bug.

--
Ticket URL: <https://core.trac.wordpress.org/ticket/36610>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list