[wp-trac] [WordPress Trac] #57483: Improve optimizations for WP_Site::get_instance and WP_Network::get_instance
WordPress Trac
noreply at wordpress.org
Tue Jan 17 13:49:43 UTC 2023
#57483: Improve optimizations for WP_Site::get_instance and
WP_Network::get_instance
--------------------------+-----------------------------
Reporter: kovshenin | Owner: (none)
Type: defect (bug) | Status: new
Priority: normal | Milestone: Awaiting Review
Component: General | Version: 5.3
Severity: normal | Keywords:
Focuses: performance |
--------------------------+-----------------------------
Following up on #42251 and r45910.
I recently ran into a problem with this specific optimization. In one case
it resulted in an existing site being redirected to the main site as if it
did not exist. In a second case it went all the way to `dead_db()`. In
both instances the existing site in question had a `-1` stored in the
object cache. Both problems persisted until the object cache was flushed.
I think the main problem is that in `WP_Site::get_instance` (and Network)
we do not account for `wpdb::get_row` failing, we don't check
`last_error`. This means that a brief database outage can cause a
permanent `-1` be saved in object cache for a site that actually does
exist.
It's not easy to reproduce, as the existing/valid sites will be in cache
most of the time, so that portion of the codebase is rarely executed.
However, there is one core function that helps with this tremendously.
It's `wpmu_update_blogs_date` and it invalidates the site cache every time
it needs to update the timestamp (when a post is published or update for
example).
I wrote this little mu-plugin to help reproduce the problem:
{{{
add_action( 'muplugins_loaded', function() {
switch_to_blog( 3 );
wpmu_update_blogs_date();
restore_current_blog();
die();
} );
}}}
I fire some traffic at the site and observed the MySQL query log to
confirm a significant amount of `UPDATE` queries for the site timestamp,
as well as `SELECT * FROM wp_blogs ...` queries which came from
`WP_Site::get_instance()`.
Next, I simulated some turbulence in the network, by killing all
established connections on the MySQL port 3306 (using tcpkill), resulting
in a few "MySQL has gone away" errors in the PHP error log. I did this a
couple of times until the site with the ID of 3 disappeared from My Sites
in the Network Admin.
I confirmed the object cache state in Redis:
{{{
127.0.0.1:6379> get sites:3
"-1"
}}}
I was able to reproduce this quite reliably with about 50 concurrent
requests to the site, and 2-3 rounds of tcpkill. Of course it doesn't
happen all that often under normal conditions, though I was unlucky enough
to see it happen twice in the last 6 months (on very high traffic sites
with plenty of updates).
Personally I feel like we should revert this entire optimization. As much
as I appreciate all the hard work, I don't see much benefit of caching IDs
that don't exist, much like we don't cache posts that do not exist.
At the very least though, we should check the `last_error` before caching
a `-1` in both cases.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/57483>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list