[wp-trac] [WordPress Trac] #36393: Loss of multibyte comment author names

WordPress Trac noreply at wordpress.org
Fri Apr 1 07:04:35 UTC 2016


#36393: Loss of multibyte comment author names
--------------------------+-----------------------------
 Reporter:  cfinke        |      Owner:
     Type:  defect (bug)  |     Status:  new
 Priority:  normal        |  Milestone:  Awaiting Review
Component:  Comments      |    Version:  trunk
 Severity:  normal        |   Keywords:
  Focuses:                |
--------------------------+-----------------------------
 Some multibyte comment author names can be lost during comment submission.

 Example: consider a comment authored by a user named `テテテテテテテテテテテテテテテテテテテテ
 テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ`. This
 name is a 258-byte string, longer than the maximum length of the
 `comment_author` column. `$wpdb->strip_invalid_text_for_column()` will
 truncate it to 255 bytes, and because each character is three bytes, the
 string is still "valid," albeit one character shorter.

 After `$wpdb->strip_invalid_text_for_column()` runs,
 `sanitize_text_field()` will run, which calls `wp_check_invalid_utf8()`,
 which will do nothing, because the string is still valid utf8.

 If this commenter's older sister, `Aテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
 テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ`, also tries to comment,
 the result is very different. This name is a 259 byte string.
 `$wpdb->strip_invalid_text_for_column()` will truncate it to 255 bytes,
 taking off one character and 1/3 of another. When
 `wp_check_invalid_utf8()` gets called, it will truncate the string to zero
 bytes out of an abundance of caution, since the string ends with something
 that is not valid utf8.

 It's clear that the commenter was not submitting invalid utf8, and the
 true goal of `$wpdb->strip_invalid_text_for_column()` was to ensure that
 the text would fit in the DB column without auto-truncation by the DB
 engine, so the ideal behavior should be that the string is truncated to
 the longest possible length that remains valid and fits within the column.

 One way to get around this data loss would be a wrapper around
 `wp_check_invalid_utf8()`. If `wp_check_invalid_utf8()` fails, chop a
 single byte off the end of the string and check it again, up to the point
 where you have checked the string without the last five bytes (as I
 believe that the longest a single character can be is six bytes, although
 I'm not positive about that and I think anything longer than four bytes is
 mostly theoretical). Or, fix `$wpdb->strip_invalid_text_for_column()` so
 that it doesn't truncate in the middle of a multibyte character.

 Configuration details: Tested in both WordPress 4.4.2 and trunk
 (4.5-RC1-37153); PHP 5.2.17

 I noticed this issue in regards to commenter names, so here's the
 structure of my comments DB table (created in 2006, FWIW):

 {{{
 CREATE TABLE `wp_comments` (
  `comment_ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `comment_post_ID` bigint(20) unsigned NOT NULL DEFAULT '0',
  `comment_author` tinytext NOT NULL,
  `comment_author_email` varchar(100) NOT NULL DEFAULT '',
  `comment_author_url` varchar(200) NOT NULL DEFAULT '',
  `comment_author_IP` varchar(100) NOT NULL DEFAULT '',
  `comment_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `comment_date_gmt` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  `comment_content` text NOT NULL,
  `comment_karma` int(11) NOT NULL DEFAULT '0',
  `comment_approved` varchar(20) NOT NULL DEFAULT '1',
  `comment_agent` varchar(255) NOT NULL DEFAULT '',
  `comment_type` varchar(20) NOT NULL DEFAULT '',
  `comment_parent` bigint(20) unsigned NOT NULL DEFAULT '0',
  `user_id` bigint(20) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`comment_ID`),
  KEY `comment_post_ID` (`comment_post_ID`),
  KEY `comment_approved_date_gmt` (`comment_approved`,`comment_date_gmt`),
  KEY `comment_date_gmt` (`comment_date_gmt`),
  KEY `comment_parent` (`comment_parent`),
  KEY `comment_author_email` (`comment_author_email`(10))
 ) ENGINE=MyISAM AUTO_INCREMENT=2130254 DEFAULT CHARSET=latin1;
 }}}

 In case the strings I used as example commenter names above get mangled,
 here are their base64 encodings:

 commenter1: string(344)
 "776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D"

 commenter2: string(348)
 "Qe++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++gw=="

 I'm attaching a POC plugin that manually walks through how the commenter
 name gets handled in the comment submission process (but only when the
 first attempt to save the comment fails and then requires the
 `$wpdb->strip_invalid_text_for_column()` call).

--
Ticket URL: <https://core.trac.wordpress.org/ticket/36393>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list