[wp-hackers] About the Ticket #13590

Andrea Ercolino cappuccino.e.cornetto at gmail.com
Wed Jan 19 08:57:28 UTC 2011


I've recently developed a class for escaping / unescaping UTF-8
characters. I've released it as a Zend Framework class
(Zend_Utf8<http://framework.zend.com/wiki/display/ZFPROP/Zend_Utf8+-+Andrea+Ercolino>,
now in the proposal stage), and as stand-alone class for a WordPress plugin
(Full UTF-8 <http://wordpress.org/extend/plugins/full-utf-8/>).

The plugin was meant to fix the same issue of the Ticket #13590 (Inserting a
tetragram (SMP/Plane 1) character truncates post
fields<http://core.trac.wordpress.org/ticket/13590>),
which I stumbled upon some days ago when trying to write an article about
the RFC4627 (JSON <http://www.ietf.org/rfc/rfc4627.txt>).
The plugin works pretty well for post content (and title, excerpt and
search) but it doesn't cover custom fields. For them I had to write a patch
that changed 8 different files. Anyway, a plugin + a patch is not a clean
solution. And it's possible that some data string get's through to the db,
following an alternative path I couldn't find with my reverse engineering.

So I thought: Why not to wrap db queries inside escape / unescape
parentheses? In this way

   - nothing will ever hit the db without taking care of
   - the patch will be extremely localized

I wrote the patch, ran the tests, saw that the issue got solved, and all
seemed fine to me.

There are some questions that need to be answered:

   1. Does this solution slow WP down too much?
   2. Does this solution fail sometime?

I've no clear answers, but hints.

   1. The escaping/unescaping are cheap operations, but they do examine a
   string char by char. In this use case, I already short-circuited any char
   MySQL can handle by itself (3 bytes UTF-8).
   2. Only strings are escaped/unescaped, the rest is short-circuited (at
   least when writing to the db, when reading all is a string), so I think that
   only a binary string could cause some troubles.

I'd like to know your thoughts, and if the patch could find its way into
some next (close) WP release.
Here are the parts of the patch that change WP, the whole patch (with two
added files) is instead attached.

diff -rupN --exclude-from wpdiffexclude.txt
wordpress-3.0.4/wp-includes/wp-db.php wp-db-patched/wp-includes/wp-db.php
--- wordpress-3.0.4/wp-includes/wp-db.php 2010-07-25 08:34:50.000000000
+0200
+++ wp-db-patched/wp-includes/wp-db.php 2011-01-18 12:51:56.000000000 +0100
@@ -1108,7 +1108,8 @@ class wpdb {
  $dbh =& $this->dbh;
  $this->last_db_used = "other/read";
  }
-
+
+ full_utf8_escape($query);
  $this->result = @mysql_query( $query, $dbh );
  $this->num_queries++;

@@ -1136,8 +1137,9 @@ class wpdb {
  $i++;
  }
  $num_rows = 0;
- while ( $row = @mysql_fetch_object( $this->result ) ) {
- $this->last_result[$num_rows] = $row;
+ while ( $row = @mysql_fetch_assoc( $this->result ) ) {
+    array_walk($row, 'full_utf8_unescape');
+ $this->last_result[$num_rows] = (object) $row;
  $num_rows++;
  }

diff -rupN --exclude-from wpdiffexclude.txt wordpress-3.0.4/wp-settings.php
wp-db-patched/wp-settings.php
--- wordpress-3.0.4/wp-settings.php 2010-05-02 23:18:36.000000000 +0200
+++ wp-db-patched/wp-settings.php 2011-01-16 20:19:32.000000000 +0100
@@ -66,6 +66,7 @@ wp_set_lang_dir();
 require( ABSPATH . WPINC . '/compat.php' );
 require( ABSPATH . WPINC . '/functions.php' );
 require( ABSPATH . WPINC . '/classes.php' );
+require( ABSPATH . WPINC . '/full-utf8.php' );

 // Include the wpdb class, or a db.php database drop-in if present.
 require_wp_db();


More information about the wp-hackers mailing list