[wp-trac] [WordPress Trac] #65375: Paginated XML sitemap sub-pages return 404 once paged exceeds the blog post page count
WordPress Trac
noreply at wordpress.org
Sat May 30 16:50:28 UTC 2026
#65375: Paginated XML sitemap sub-pages return 404 once paged exceeds the blog post
page count
--------------------------+------------------------------------------
Reporter: extrachill | Owner: (none)
Type: defect (bug) | Status: new
Priority: normal | Milestone: Awaiting Review
Component: Sitemaps | Version: 5.5
Severity: normal | Keywords: needs-patch needs-unit-tests
Focuses: |
--------------------------+------------------------------------------
A paginated core XML sitemap sub-page — e.g. `wp-sitemap-
posts-<post_type>-N.xml` or `wp-sitemap-taxonomies-<tax>-N.xml` — is
served with an HTTP '''404''' status once the page number `N` exceeds the
number of pages in the '''blog (`post`) sitemap''', even though
`WP_Sitemaps` renders a full, valid `<urlset>` body for that page.
The 404 boundary tracks the site's regular `post` count and
`posts_per_page`, '''not''' the post type or taxonomy actually being
requested. On a site with many CPT/taxonomy URLs but few regular posts,
the sitemap index advertises (say) 40 sub-sitemaps but only the first few
return `200`; the rest return `404` with valid XML in the body. Search
engines discard the 404'd pages, so most of the advertised URLs never get
indexed.
== Environment ==
* WordPress 5.5 through 7.0 (reproduced unchanged across versions in
clean WordPress Playground instances — no third-party plugins). See
"Affected versions" below.
* Pretty permalinks enabled; core XML sitemaps enabled (default).
* A public custom post type, OR a custom taxonomy registered on both
`post` and a CPT.
== Steps to reproduce ==
Run the self-contained script below in any clean WordPress with pretty
permalinks + core sitemaps enabled (no third-party plugins), via `wp eval-
file sitemap-paged-404-repro.php`. It registers a public CPT and a
taxonomy shared on both `post` and the CPT, seeds a boundary condition
where the blog sitemap has fewer pages than the CPT sitemap, drives the
real core request lifecycle (`WP::query_posts()` -> `WP::handle_404()`)
for each sitemap sub-page URL, and prints the resulting HTTP status.
Proven on WP 7.0: prints `bug_reproduced: true` with the beyond-boundary
CPT page at `status_header: 404` while the renderer holds valid URLs.
{{{#!php
<?php
/**
* Standalone reproduction for: paginated XML sitemap sub-pages 404 once
`paged`
* exceeds the blog (`post`) sitemap page count.
*
* Runs in any clean WordPress with pretty permalinks + core sitemaps
enabled.
* No third-party plugins required.
*
* wp eval-file sitemap-paged-404-repro.php
*
* It registers a public CPT and a taxonomy shared on both `post` and the
CPT,
* seeds a boundary condition where the blog sitemap has FEWER pages than
the
* CPT sitemap, drives the real core request lifecycle (WP::query_posts()
->
* WP::handle_404()) for each sitemap sub-page URL, and prints the
resulting
* HTTP status + is_404. A reproducing run prints `bug_reproduced: true`
with
* the beyond-boundary CPT page at status 404 while the renderer holds
valid URLs.
*/
if ( ! defined( 'ABSPATH' ) ) {
exit( 1 );
}
// 1. Public CPT + a taxonomy shared on BOTH `post` and the CPT. No 404
hooks.
register_post_type( 'repro_event', array(
'public' => true,
'has_archive' => true,
'rewrite' => array( 'slug' => 'repro-event' ),
'taxonomies' => array( 'repro_artist' ),
) );
register_taxonomy( 'repro_artist', array( 'post', 'repro_event' ), array(
'public' => true,
'hierarchical' => false,
'rewrite' => array( 'slug' => 'repro-artist' ),
) );
// Small sitemap page size so the CPT sitemap advertises multiple pages
cheaply.
// Does NOT change the mechanism: the 404 boundary is the MAIN query's
// posts_per_page (blog page count), not this value.
add_filter( 'wp_sitemaps_max_urls', static function () { return 50; } );
// 2. Pretty permalinks + tiny blog page size, then flush rewrites.
update_option( 'permalink_structure', '/%postname%/' );
update_option( 'posts_per_page', 5 );
global $wp_rewrite;
$wp_rewrite->init();
$wp_rewrite->set_permalink_structure( '/%postname%/' );
$wp_rewrite->flush_rules( false );
// 3. Seed: blog 6 posts / 5 per page = 2 main-query pages -> 404 boundary
at page 3.
// CPT 120 posts / 50 max_urls = 3 CPT sitemap pages. Page 3 >
boundary 2.
$term = wp_insert_term( 'Test Artist', 'repro_artist', array( 'slug' =>
'test-artist' ) );
$term_id = is_wp_error( $term )
? (int) get_term_by( 'slug', 'test-artist', 'repro_artist'
)->term_id
: (int) $term['term_id'];
for ( $i = 1; $i <= 6; $i++ ) {
wp_insert_post( array( 'post_type' => 'post', 'post_status' =>
'publish', 'post_title' => "Blog Post $i", 'post_name' => "blog-post-$i" )
);
}
for ( $i = 1; $i <= 120; $i++ ) {
$pid = wp_insert_post( array( 'post_type' => 'repro_event',
'post_status' => 'publish', 'post_title' => "Repro Event $i", 'post_name'
=> "repro-event-$i" ) );
if ( $pid && ! is_wp_error( $pid ) ) {
wp_set_object_terms( $pid, array( $term_id ),
'repro_artist' );
}
}
$ppp = (int) get_option( 'posts_per_page' );
$blog_main_query_pages = (int) ceil( (int) wp_count_posts( 'post'
)->publish / $ppp );
// 4. Drive the real core request lifecycle for each sitemap sub-page URL.
$probe = static function ( $query_vars_string ) {
global $wp, $wp_query, $wp_the_query;
$captured = array( 'status' => null );
$cb = static function ( $header ) use ( &$captured ) {
if ( preg_match( '#\s(\d{3})\s#', ' ' . $header . ' ', $m
) ) {
$captured['status'] = (int) $m[1];
}
return $header;
};
add_filter( 'status_header', $cb, 10, 1 );
$wp = new WP();
$wp_query = new WP_Query();
$wp_the_query = $wp_query;
$GLOBALS['wp_query'] = $wp_query;
parse_str( $query_vars_string, $qv );
$wp->query_vars = $qv;
$wp->query_posts(); // class-wp.php:824
$wp->handle_404(); // class-wp.php:825 — sets the status
$is_404 = $wp_query->is_404();
// What WP_Sitemaps would actually render for this page.
$server = wp_sitemaps_get_server();
$sitemap = $qv['sitemap'] ?? '';
$subtype = $qv['sitemap-subtype'] ?? '';
$paged = isset( $qv['paged'] ) ? (int) $qv['paged'] : 1;
$url_count = null;
if ( $sitemap && 'index' !== $sitemap ) {
$provider = $server->registry->get_provider( $sitemap );
if ( $provider ) {
$url_list = $provider->get_url_list( $paged ?: 1,
$subtype );
$url_count = is_array( $url_list ) ? count(
$url_list ) : 0;
}
}
remove_filter( 'status_header', $cb, 10 );
return array(
'query' => $query_vars_string,
'main_query_post_type' => $wp_query->get(
'post_type' ),
'main_query_posts' => count( $wp_query->posts ),
'handle_404_set_is_404' => $is_404,
'status_header' => $captured['status'],
'sitemap_renderer_url_count' => $url_count,
);
};
$probes = array(
'cpt_page_1' => $probe( 'sitemap=posts&sitemap-
subtype=repro_event&paged=1' ),
'cpt_page_3' => $probe( 'sitemap=posts&sitemap-
subtype=repro_event&paged=3' ),
'artist_page_3' => $probe( 'sitemap=taxonomies&sitemap-
subtype=repro_artist&paged=3' ),
);
$bug_reproduced = ( true ===
$probes['cpt_page_3']['handle_404_set_is_404'] )
&& ( $probes['cpt_page_3']['sitemap_renderer_url_count'] > 0 );
echo wp_json_encode( array(
'wp_version' => get_bloginfo( 'version' ),
'blog_main_query_pages' => $blog_main_query_pages,
'probes' => $probes,
'bug_reproduced' => $bug_reproduced,
), JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES ) . "\n";
}}}
== Expected vs actual ==
'''Expected:''' `200 OK` with the page's `<urlset>` (the renderer produces
valid URLs for that page).
'''Actual:''' `404 Not Found`, but the response body is a complete, valid
`<urlset>`. The status header and the body disagree.
Captured output (clean WP 7.0, no plugins):
|| '''request''' || '''paged''' || '''main query post_type''' || '''main
query posts''' || '''status''' || '''renderer URL count''' ||
|| wp-sitemap-posts-repro_event-1.xml || 1 || "" (defaults to post) || 5
|| '''200''' || 50 ||
|| wp-sitemap-posts-repro_event-3.xml || 3 || "" (defaults to post) || 0
|| '''404''' || '''20''' ||
|| wp-sitemap-taxonomies-repro_artist-3.xml || 3 || "" (defaults to post)
|| 0 || '''404''' || valid ||
Blog main-query pages = 2 (7 posts / 5 per page); CPT sitemap pages = 3.
The 404 appears at page 3 because 3 > 2 — i.e. the boundary is the
'''blog''' page count, irrespective of the requested subtype.
== Root cause ==
The core sitemap rewrite rule maps a paginated sitemap URL to query vars
'''without a `post_type`''':
{{{
^wp-sitemap-([a-z]+?)-([a-z\d_-]+?)-(\d+?)\.xml$
=> index.php?sitemap=$1&sitemap-subtype=$2&paged=$3
}}}
So `wp-sitemap-posts-repro_event-3.xml` becomes `index.php?sitemap=posts
&sitemap-subtype=repro_event&paged=3`. There is no `post_type` query var,
so the '''dummy main `WP_Query`''' built for the request defaults to the
`post` post type at the site's `posts_per_page`.
`WP::main()` (source:tags/7.0/src/wp-includes/class-wp.php#L818) runs, in
order:
* `query_posts()` — source:tags/7.0/src/wp-includes/class-wp.php#L824 —
runs the dummy main query (post type = post)
* `handle_404()` — source:tags/7.0/src/wp-includes/class-wp.php#L825 —
'''decides the status BEFORE the renderer runs'''
* `register_globals()` — L826
* `send_headers()` — L829
* `do_action( 'wp' )` — L838
In `WP::handle_404()` (source:tags/7.0/src/wp-includes/class-wp.php#L724),
for `paged=3` the main query returns zero `post`s:
* L754 `elseif ( $wp_query->posts )` — false (no posts).
* L783 `elseif ( ! is_paged() )` / L789
(`is_tag()`/`is_category()`/`is_tax()`/`is_post_type_archive()` +
`get_queried_object()`, `is_home()`, `is_search()`, `is_feed()`) — none of
these exemptions match a sitemap query.
* `$set_404` stays true -> L797–L801: `$wp_query->set_404();
status_header( 404 ); nocache_headers();`.
Then, later in the request, `WP_Sitemaps::render_sitemaps()`
(source:tags/7.0/src/wp-includes/sitemaps/class-wp-sitemaps.php#L163),
hooked on '''`template_redirect`''' (registered in `WP_Sitemaps::init()`
at source:tags/7.0/src/wp-includes/sitemaps/class-wp-sitemaps.php#L69),
runs '''after''' `WP::main()` has already returned and already set the
404. It queries its '''own''' provider (`$provider->get_url_list( $paged,
$object_subtype )`, L208), gets a valid non-empty URL list, and renders it
(L217 `render_sitemap()`, L218 `exit`). It only sets its own 404 when the
page is genuinely empty (L211–L214) — and it '''never resets''' the status
header back to `200` for a populated page.
Net: `handle_404()` (main query, defaults to `post`) and
`render_sitemaps()` (provider query, correct subtype) disagree about
whether the page exists, and `handle_404()` wins the status header because
it runs first and the renderer never corrects it.
== Why there is no existing guard ==
Sitemaps originally routed through `WP_Sitemaps::redirect_sitemapxml()`,
which '''was''' a `pre_handle_404` filter and short-circuited 404 handling
for sitemap requests. That method was '''deprecated in 6.7.0'''
(source:tags/7.0/src/wp-includes/sitemaps/class-wp-sitemaps.php#L231,
`_deprecated_function( __FUNCTION__, '6.7.0' )`) when sitemap routing
moved to rewrite rules — but nothing replaced its `pre_handle_404` bypass.
So the move to rewrite rules left the main-query 404 path unguarded for
sitemap routes.
== Proposed fix ==
Have `WP_Sitemaps` short-circuit 404 handling for its own routes,
restoring the guarantee the deprecated `redirect_sitemapxml()` used to
provide — the renderer already owns the legitimate empty-page 404
(L211–L214):
{{{#!php
<?php
// In WP_Sitemaps::init(), alongside the template_redirect registration:
add_filter( 'pre_handle_404', array( $this, 'pre_handle_404' ), 10, 2 );
/**
* Short-circuit core 404 handling for sitemap routes so the dummy main
query
* (which defaults to the `post` post type) cannot 404 a sitemap response.
* WP_Sitemaps::render_sitemaps() remains responsible for the response and
* still issues its own 404 when a page genuinely has no URLs.
*/
public function pre_handle_404( $bypass, $wp_query ) {
if ( $bypass ) {
return $bypass;
}
if ( $wp_query->get( 'sitemap' ) || $wp_query->get( 'sitemap-
stylesheet' ) ) {
return true;
}
return $bypass;
}
}}}
Alternatively, exempt sitemap query vars inside `WP::handle_404()` itself,
but the `pre_handle_404` approach keeps the sitemap-specific knowledge in
`WP_Sitemaps`.
== Affected versions (verified) ==
Not a regression — present since core XML sitemaps shipped in 5.5. The
same repro script was run unchanged across multiple major versions in
clean Playground instances; the paginated sub-page is flagged `is_404`
(and served 404) in every one:
|| '''WordPress''' || '''cpt_page_3 handle_404 set is_404''' ||
'''renderer URL count''' ||
|| 5.5.18 || true || 20 ||
|| 6.6.5 || true || 20 ||
|| 6.7.5 || true || 20 ||
|| 7.0 || true || 20 ||
(The deprecation of `WP_Sitemaps::redirect_sitemapxml()` in 6.7.0 is
unrelated: that method only performed a `wp_safe_redirect()` for the
legacy `pagename=sitemap-xml` permalink case and never guarded paginated
sub-sitemap requests against the main-query 404. The bug reproduces
identically on 6.6, before that deprecation.)
== Related ==
Surfaced while investigating a related virtual-route status bug in feeds
(`get_feed_build_date()` / `WP_Query` `fields => ids`, see
[https://github.com/WordPress/wordpress-develop/pull/11387 wordpress-
develop#11387], where the same "is the virtual route masking a lower-level
main-query issue?" question applies). This sitemap ticket is the sitemap-
route analogue: a virtual route whose status is being decided by the
unrelated dummy main query.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/65375>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list