[HyperDB] Sharding scheme

Tue Jan 25 18:03:41 UTC 2011

I have an install that will likely grow to 10,000 sites by year's end.  
There's potential in this market space for that number to be 
significantly higher.  In the last week, it has doubled in size to 2200 
sites. The database now has over 20,000 tables.  Obviously, there is a 
need for sharding here. There were extenuating circumstances that 
prevented sharding on the initial launch but I would like to activate it 
now.

I was wondering what you thought would be a good breakdown of the 
sites...  I was thinking 100 sites per dataset would be a fair number. 
That puts each databases at 1500-ish tables per dataset.  It would 
result in my having 100 databases in the end, it has a nice even ring to 
it and is easily comprehended at a glance. My worry is that I am 
creating too many database connections and will a bottle neck of another 
kind.  Is this a concern? I don't want to create a problem where one 
doesn't currently exist. The hardware is keeping up just fine at the 
moment and I am getting sub 3 or 2 second load times.

So I was thinking of functions like this in db-config.php:

for ( $counter = 0; $counter <= 30; $counter++) { // Allow for 3000 
sites for now
     // Add the connection to the Master
     $wpdb->add_database(array(
         'host' => DB_HOST,
         'user' => DB_USER,
         'password' => DB_PASSWORD,
         'name' => DB_NAME . '_' . $counter,
         'write' => 1,
         'read' => 1,
         'dataset' => DB_NAME . '_' . $counter,
         'timeout' => 0.2, ));

     // Add the connection to the Slave(s)
     $wpdb->add_database(array(
         'host' => DB_HOST2,
         'user' => DB_USER,
         'password' => DB_PASSWORD,
         'name' => DB_NAME . '_' . $counter,
         'write' => 0,
         'read' => 1,
         'dataset' => DB_NAME . '_' . $counter,
         'timeout' => 0.2,
     ));
}

$wpdb->add_callback('ehi_dbname');

function ehi_dbname($query, $wpdb) {
     if ( preg_match("/^{$wpdb->base_prefix}\d+_/i", $wpdb->table) ) {
         $dbid = intval($wpdb->blogid/100);
         $NEW_DB_NAME = DB_NAME . "_" . $dbid;
         error_log ($wpdb->table . " would use :" . $NEW_DB_NAME);
     } else {
         $NEW_DB_NAME = DB_NAME . "_common";
         error_log ($wpdb->table . " would use :" . $NEW_DB_NAME);
     }

     return $NEW_DB_NAME;

}

Other than the fact that I wouldn't create all 100 database connections 
immediately, does that make sense? I'll probably make a foreach loop to 
make adding read servers an easy task. But I think I've got the basic 
concepts all there.  Does this seem to be a reasonable configuration 
with 1 read and 1 read/write server. These are currently running on hand 
built Amazon m1.large ec2 servers - though I am considering switching 
them over to RDS during this process.

One more thought...  Moving forward, as the need arises to create more 
read servers, does it make sense to have all of the datasets active on 
all of the read servers in order to keep reads evenly distributed? Or 
would it be better to assign odd datasets to odd numbered read servers 
and even to even numbered read servers and thus divide the connections 
needed  and evenly reduce the databases being accessed on each machines. 
Would this perhaps improve caching?

I'd love to gain from the expertise of people who have been down this 
road before.

Thanks!

-Brian Layman
The eHermit
eHermitsInc.com