[wp-trac] [WordPress Trac] #65375: Paginated XML sitemap sub-pages return 404 once paged exceeds the blog post page count

WordPress Trac noreply at wordpress.org
Sat May 30 16:50:28 UTC 2026


#65375: Paginated XML sitemap sub-pages return 404 once paged exceeds the blog post
page count
--------------------------+------------------------------------------
 Reporter:  extrachill    |      Owner:  (none)
     Type:  defect (bug)  |     Status:  new
 Priority:  normal        |  Milestone:  Awaiting Review
Component:  Sitemaps      |    Version:  5.5
 Severity:  normal        |   Keywords:  needs-patch needs-unit-tests
  Focuses:                |
--------------------------+------------------------------------------
 A paginated core XML sitemap sub-page — e.g. `wp-sitemap-
 posts-<post_type>-N.xml` or `wp-sitemap-taxonomies-<tax>-N.xml` — is
 served with an HTTP '''404''' status once the page number `N` exceeds the
 number of pages in the '''blog (`post`) sitemap''', even though
 `WP_Sitemaps` renders a full, valid `<urlset>` body for that page.

 The 404 boundary tracks the site's regular `post` count and
 `posts_per_page`, '''not''' the post type or taxonomy actually being
 requested. On a site with many CPT/taxonomy URLs but few regular posts,
 the sitemap index advertises (say) 40 sub-sitemaps but only the first few
 return `200`; the rest return `404` with valid XML in the body. Search
 engines discard the 404'd pages, so most of the advertised URLs never get
 indexed.

 == Environment ==

  * WordPress 5.5 through 7.0 (reproduced unchanged across versions in
 clean WordPress Playground instances — no third-party plugins). See
 "Affected versions" below.
  * Pretty permalinks enabled; core XML sitemaps enabled (default).
  * A public custom post type, OR a custom taxonomy registered on both
 `post` and a CPT.

 == Steps to reproduce ==

 Run the self-contained script below in any clean WordPress with pretty
 permalinks + core sitemaps enabled (no third-party plugins), via `wp eval-
 file sitemap-paged-404-repro.php`. It registers a public CPT and a
 taxonomy shared on both `post` and the CPT, seeds a boundary condition
 where the blog sitemap has fewer pages than the CPT sitemap, drives the
 real core request lifecycle (`WP::query_posts()` -> `WP::handle_404()`)
 for each sitemap sub-page URL, and prints the resulting HTTP status.
 Proven on WP 7.0: prints `bug_reproduced: true` with the beyond-boundary
 CPT page at `status_header: 404` while the renderer holds valid URLs.

 {{{#!php
 <?php
 /**
  * Standalone reproduction for: paginated XML sitemap sub-pages 404 once
 `paged`
  * exceeds the blog (`post`) sitemap page count.
  *
  * Runs in any clean WordPress with pretty permalinks + core sitemaps
 enabled.
  * No third-party plugins required.
  *
  *   wp eval-file sitemap-paged-404-repro.php
  *
  * It registers a public CPT and a taxonomy shared on both `post` and the
 CPT,
  * seeds a boundary condition where the blog sitemap has FEWER pages than
 the
  * CPT sitemap, drives the real core request lifecycle (WP::query_posts()
 ->
  * WP::handle_404()) for each sitemap sub-page URL, and prints the
 resulting
  * HTTP status + is_404. A reproducing run prints `bug_reproduced: true`
 with
  * the beyond-boundary CPT page at status 404 while the renderer holds
 valid URLs.
  */

 if ( ! defined( 'ABSPATH' ) ) {
         exit( 1 );
 }

 // 1. Public CPT + a taxonomy shared on BOTH `post` and the CPT. No 404
 hooks.
 register_post_type( 'repro_event', array(
         'public'      => true,
         'has_archive' => true,
         'rewrite'     => array( 'slug' => 'repro-event' ),
         'taxonomies'  => array( 'repro_artist' ),
 ) );
 register_taxonomy( 'repro_artist', array( 'post', 'repro_event' ), array(
         'public'       => true,
         'hierarchical' => false,
         'rewrite'      => array( 'slug' => 'repro-artist' ),
 ) );

 // Small sitemap page size so the CPT sitemap advertises multiple pages
 cheaply.
 // Does NOT change the mechanism: the 404 boundary is the MAIN query's
 // posts_per_page (blog page count), not this value.
 add_filter( 'wp_sitemaps_max_urls', static function () { return 50; } );

 // 2. Pretty permalinks + tiny blog page size, then flush rewrites.
 update_option( 'permalink_structure', '/%postname%/' );
 update_option( 'posts_per_page', 5 );
 global $wp_rewrite;
 $wp_rewrite->init();
 $wp_rewrite->set_permalink_structure( '/%postname%/' );
 $wp_rewrite->flush_rules( false );

 // 3. Seed: blog 6 posts / 5 per page = 2 main-query pages -> 404 boundary
 at page 3.
 //          CPT 120 posts / 50 max_urls = 3 CPT sitemap pages. Page 3 >
 boundary 2.
 $term    = wp_insert_term( 'Test Artist', 'repro_artist', array( 'slug' =>
 'test-artist' ) );
 $term_id = is_wp_error( $term )
         ? (int) get_term_by( 'slug', 'test-artist', 'repro_artist'
 )->term_id
         : (int) $term['term_id'];

 for ( $i = 1; $i <= 6; $i++ ) {
         wp_insert_post( array( 'post_type' => 'post', 'post_status' =>
 'publish', 'post_title' => "Blog Post $i", 'post_name' => "blog-post-$i" )
 );
 }
 for ( $i = 1; $i <= 120; $i++ ) {
         $pid = wp_insert_post( array( 'post_type' => 'repro_event',
 'post_status' => 'publish', 'post_title' => "Repro Event $i", 'post_name'
 => "repro-event-$i" ) );
         if ( $pid && ! is_wp_error( $pid ) ) {
                 wp_set_object_terms( $pid, array( $term_id ),
 'repro_artist' );
         }
 }

 $ppp                   = (int) get_option( 'posts_per_page' );
 $blog_main_query_pages = (int) ceil( (int) wp_count_posts( 'post'
 )->publish / $ppp );

 // 4. Drive the real core request lifecycle for each sitemap sub-page URL.
 $probe = static function ( $query_vars_string ) {
         global $wp, $wp_query, $wp_the_query;

         $captured = array( 'status' => null );
         $cb = static function ( $header ) use ( &$captured ) {
                 if ( preg_match( '#\s(\d{3})\s#', ' ' . $header . ' ', $m
 ) ) {
                         $captured['status'] = (int) $m[1];
                 }
                 return $header;
         };
         add_filter( 'status_header', $cb, 10, 1 );

         $wp                  = new WP();
         $wp_query            = new WP_Query();
         $wp_the_query        = $wp_query;
         $GLOBALS['wp_query'] = $wp_query;

         parse_str( $query_vars_string, $qv );
         $wp->query_vars = $qv;
         $wp->query_posts();   // class-wp.php:824
         $wp->handle_404();    // class-wp.php:825 — sets the status

         $is_404 = $wp_query->is_404();

         // What WP_Sitemaps would actually render for this page.
         $server    = wp_sitemaps_get_server();
         $sitemap   = $qv['sitemap'] ?? '';
         $subtype   = $qv['sitemap-subtype'] ?? '';
         $paged     = isset( $qv['paged'] ) ? (int) $qv['paged'] : 1;
         $url_count = null;
         if ( $sitemap && 'index' !== $sitemap ) {
                 $provider = $server->registry->get_provider( $sitemap );
                 if ( $provider ) {
                         $url_list  = $provider->get_url_list( $paged ?: 1,
 $subtype );
                         $url_count = is_array( $url_list ) ? count(
 $url_list ) : 0;
                 }
         }
         remove_filter( 'status_header', $cb, 10 );

         return array(
                 'query'                      => $query_vars_string,
                 'main_query_post_type'       => $wp_query->get(
 'post_type' ),
                 'main_query_posts'           => count( $wp_query->posts ),
                 'handle_404_set_is_404'      => $is_404,
                 'status_header'              => $captured['status'],
                 'sitemap_renderer_url_count' => $url_count,
         );
 };

 $probes = array(
         'cpt_page_1'    => $probe( 'sitemap=posts&sitemap-
 subtype=repro_event&paged=1' ),
         'cpt_page_3'    => $probe( 'sitemap=posts&sitemap-
 subtype=repro_event&paged=3' ),
         'artist_page_3' => $probe( 'sitemap=taxonomies&sitemap-
 subtype=repro_artist&paged=3' ),
 );

 $bug_reproduced = ( true ===
 $probes['cpt_page_3']['handle_404_set_is_404'] )
         && ( $probes['cpt_page_3']['sitemap_renderer_url_count'] > 0 );

 echo wp_json_encode( array(
         'wp_version'            => get_bloginfo( 'version' ),
         'blog_main_query_pages' => $blog_main_query_pages,
         'probes'               => $probes,
         'bug_reproduced'       => $bug_reproduced,
 ), JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES ) . "\n";
 }}}

 == Expected vs actual ==

 '''Expected:''' `200 OK` with the page's `<urlset>` (the renderer produces
 valid URLs for that page).

 '''Actual:''' `404 Not Found`, but the response body is a complete, valid
 `<urlset>`. The status header and the body disagree.

 Captured output (clean WP 7.0, no plugins):

 || '''request''' || '''paged''' || '''main query post_type''' || '''main
 query posts''' || '''status''' || '''renderer URL count''' ||
 || wp-sitemap-posts-repro_event-1.xml || 1 || "" (defaults to post) || 5
 || '''200''' || 50 ||
 || wp-sitemap-posts-repro_event-3.xml || 3 || "" (defaults to post) || 0
 || '''404''' || '''20''' ||
 || wp-sitemap-taxonomies-repro_artist-3.xml || 3 || "" (defaults to post)
 || 0 || '''404''' || valid ||

 Blog main-query pages = 2 (7 posts / 5 per page); CPT sitemap pages = 3.
 The 404 appears at page 3 because 3 > 2 — i.e. the boundary is the
 '''blog''' page count, irrespective of the requested subtype.

 == Root cause ==

 The core sitemap rewrite rule maps a paginated sitemap URL to query vars
 '''without a `post_type`''':

 {{{
 ^wp-sitemap-([a-z]+?)-([a-z\d_-]+?)-(\d+?)\.xml$
   => index.php?sitemap=$1&sitemap-subtype=$2&paged=$3
 }}}

 So `wp-sitemap-posts-repro_event-3.xml` becomes `index.php?sitemap=posts
 &sitemap-subtype=repro_event&paged=3`. There is no `post_type` query var,
 so the '''dummy main `WP_Query`''' built for the request defaults to the
 `post` post type at the site's `posts_per_page`.

 `WP::main()` (source:tags/7.0/src/wp-includes/class-wp.php#L818) runs, in
 order:

  * `query_posts()` — source:tags/7.0/src/wp-includes/class-wp.php#L824 —
 runs the dummy main query (post type = post)
  * `handle_404()` — source:tags/7.0/src/wp-includes/class-wp.php#L825 —
 '''decides the status BEFORE the renderer runs'''
  * `register_globals()` — L826
  * `send_headers()` — L829
  * `do_action( 'wp' )` — L838

 In `WP::handle_404()` (source:tags/7.0/src/wp-includes/class-wp.php#L724),
 for `paged=3` the main query returns zero `post`s:

  * L754 `elseif ( $wp_query->posts )` — false (no posts).
  * L783 `elseif ( ! is_paged() )` / L789
 (`is_tag()`/`is_category()`/`is_tax()`/`is_post_type_archive()` +
 `get_queried_object()`, `is_home()`, `is_search()`, `is_feed()`) — none of
 these exemptions match a sitemap query.
  * `$set_404` stays true -> L797–L801: `$wp_query->set_404();
 status_header( 404 ); nocache_headers();`.

 Then, later in the request, `WP_Sitemaps::render_sitemaps()`
 (source:tags/7.0/src/wp-includes/sitemaps/class-wp-sitemaps.php#L163),
 hooked on '''`template_redirect`''' (registered in `WP_Sitemaps::init()`
 at source:tags/7.0/src/wp-includes/sitemaps/class-wp-sitemaps.php#L69),
 runs '''after''' `WP::main()` has already returned and already set the
 404. It queries its '''own''' provider (`$provider->get_url_list( $paged,
 $object_subtype )`, L208), gets a valid non-empty URL list, and renders it
 (L217 `render_sitemap()`, L218 `exit`). It only sets its own 404 when the
 page is genuinely empty (L211–L214) — and it '''never resets''' the status
 header back to `200` for a populated page.

 Net: `handle_404()` (main query, defaults to `post`) and
 `render_sitemaps()` (provider query, correct subtype) disagree about
 whether the page exists, and `handle_404()` wins the status header because
 it runs first and the renderer never corrects it.

 == Why there is no existing guard ==

 Sitemaps originally routed through `WP_Sitemaps::redirect_sitemapxml()`,
 which '''was''' a `pre_handle_404` filter and short-circuited 404 handling
 for sitemap requests. That method was '''deprecated in 6.7.0'''
 (source:tags/7.0/src/wp-includes/sitemaps/class-wp-sitemaps.php#L231,
 `_deprecated_function( __FUNCTION__, '6.7.0' )`) when sitemap routing
 moved to rewrite rules — but nothing replaced its `pre_handle_404` bypass.
 So the move to rewrite rules left the main-query 404 path unguarded for
 sitemap routes.

 == Proposed fix ==

 Have `WP_Sitemaps` short-circuit 404 handling for its own routes,
 restoring the guarantee the deprecated `redirect_sitemapxml()` used to
 provide — the renderer already owns the legitimate empty-page 404
 (L211–L214):

 {{{#!php
 <?php
 // In WP_Sitemaps::init(), alongside the template_redirect registration:
 add_filter( 'pre_handle_404', array( $this, 'pre_handle_404' ), 10, 2 );

 /**
  * Short-circuit core 404 handling for sitemap routes so the dummy main
 query
  * (which defaults to the `post` post type) cannot 404 a sitemap response.
  * WP_Sitemaps::render_sitemaps() remains responsible for the response and
  * still issues its own 404 when a page genuinely has no URLs.
  */
 public function pre_handle_404( $bypass, $wp_query ) {
     if ( $bypass ) {
         return $bypass;
     }
     if ( $wp_query->get( 'sitemap' ) || $wp_query->get( 'sitemap-
 stylesheet' ) ) {
         return true;
     }
     return $bypass;
 }
 }}}

 Alternatively, exempt sitemap query vars inside `WP::handle_404()` itself,
 but the `pre_handle_404` approach keeps the sitemap-specific knowledge in
 `WP_Sitemaps`.

 == Affected versions (verified) ==

 Not a regression — present since core XML sitemaps shipped in 5.5. The
 same repro script was run unchanged across multiple major versions in
 clean Playground instances; the paginated sub-page is flagged `is_404`
 (and served 404) in every one:

 || '''WordPress''' || '''cpt_page_3 handle_404 set is_404''' ||
 '''renderer URL count''' ||
 || 5.5.18 || true || 20 ||
 || 6.6.5 || true || 20 ||
 || 6.7.5 || true || 20 ||
 || 7.0 || true || 20 ||

 (The deprecation of `WP_Sitemaps::redirect_sitemapxml()` in 6.7.0 is
 unrelated: that method only performed a `wp_safe_redirect()` for the
 legacy `pagename=sitemap-xml` permalink case and never guarded paginated
 sub-sitemap requests against the main-query 404. The bug reproduces
 identically on 6.6, before that deprecation.)

 == Related ==

 Surfaced while investigating a related virtual-route status bug in feeds
 (`get_feed_build_date()` / `WP_Query` `fields => ids`, see
 [https://github.com/WordPress/wordpress-develop/pull/11387 wordpress-
 develop#11387], where the same "is the virtual route masking a lower-level
 main-query issue?" question applies). This sitemap ticket is the sitemap-
 route analogue: a virtual route whose status is being decided by the
 unrelated dummy main query.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/65375>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform


More information about the wp-trac mailing list