[HyperDB] Sharding scheme

Mon Jan 31 01:47:34 UTC 2011

On 1/25/2011 1:03 PM, Brian Layman wrote:
> I have an install that will likely grow to 10,000 sites by year's 
> end.  There's potential in this market space for that number to be 
> significantly higher.  In the last week, it has doubled in size to 
> 2200 sites. The database now has over 20,000 tables.  Obviously, there 
> is a need for sharding here. There were extenuating circumstances that 
> prevented sharding on the initial launch but I would like to activate 
> it now.
>
> I was wondering what you thought would be a good breakdown of the 
> sites...  I was thinking 100 sites per dataset would be a fair number. 
> That puts each databases at 1500-ish tables per dataset.  It would 
> result in my having 100 databases in the end, it has a nice even ring 
> to it and is easily comprehended at a glance. My worry is that I am 
> creating too many database connections and will a bottle neck of 
> another kind.  Is this a concern? I don't want to create a problem 
> where one doesn't currently exist. The hardware is keeping up just 
> fine at the moment and I am getting sub 3 or 2 second load times.
>
> So I was thinking of functions like this in db-config.php:
>
> for ( $counter = 0; $counter <= 30; $counter++) { // Allow for 3000 
> sites for now
>     // Add the connection to the Master
>     $wpdb->add_database(array(
>         'host' => DB_HOST,
>         'user' => DB_USER,
>         'password' => DB_PASSWORD,
>         'name' => DB_NAME . '_' . $counter,
>         'write' => 1,
>         'read' => 1,
>         'dataset' => DB_NAME . '_' . $counter,
>         'timeout' => 0.2, ));
>
>     // Add the connection to the Slave(s)
>     $wpdb->add_database(array(
>         'host' => DB_HOST2,
>         'user' => DB_USER,
>         'password' => DB_PASSWORD,
>         'name' => DB_NAME . '_' . $counter,
>         'write' => 0,
>         'read' => 1,
>         'dataset' => DB_NAME . '_' . $counter,
>         'timeout' => 0.2,
>     ));
> }
>
> $wpdb->add_callback('ehi_dbname');
>
> function ehi_dbname($query, $wpdb) {
>     if ( preg_match("/^{$wpdb->base_prefix}\d+_/i", $wpdb->table) ) {
>         $dbid = intval($wpdb->blogid/100);
>         $NEW_DB_NAME = DB_NAME . "_" . $dbid;
>         error_log ($wpdb->table . " would use :" . $NEW_DB_NAME);
>     } else {
>         $NEW_DB_NAME = DB_NAME . "_common";
>         error_log ($wpdb->table . " would use :" . $NEW_DB_NAME);
>     }
>
>     return $NEW_DB_NAME;
>
> }
>
>
> Other than the fact that I wouldn't create all 100 database 
> connections immediately, does that make sense? I'll probably make a 
> foreach loop to make adding read servers an easy task. But I think 
> I've got the basic concepts all there.  Does this seem to be a 
> reasonable configuration with 1 read and 1 read/write server. These 
> are currently running on hand built Amazon m1.large ec2 servers - 
> though I am considering switching them over to RDS during this process.
>
> One more thought...  Moving forward, as the need arises to create more 
> read servers, does it make sense to have all of the datasets active on 
> all of the read servers in order to keep reads evenly distributed? Or 
> would it be better to assign odd datasets to odd numbered read servers 
> and even to even numbered read servers and thus divide the connections 
> needed  and evenly reduce the databases being accessed on each 
> machines. Would this perhaps improve caching?
>
> I'd love to gain from the expertise of people who have been down this 
> road before.
>
> Thanks!
>
Nobody has replied to this, but in the meantime I went and read the 
entirety of this mailing list's archive.  In August of last year
Andy had said:

"Without further information I'd say proceed without partitioning.
Don't try to do all your scaling in advance; it doesn't work."
experiments once you have real-world data and traffic.

when talking about a potentially 50K site database.

Later Callum mentioned:

"we partitioned the data (for about 300k blogs at that point) across multiple databases
on a single database server."

Soooooo, since I am having no noticeable slow downs (With the exception of phpmyadmin which calls "show tables") with our database of 22K tables - a mere 2426 sites, I'm thinking I should not even consider sharding.  Is that correct?