[wp-trac] [WordPress Trac] #63020: HTML API: Breadcrumbs should include element indices and attributes

Wed Feb 26 19:48:17 UTC 2025

#63020: HTML API: Breadcrumbs should include element indices and attributes
-------------------------+---------------------
 Reporter:  westonruter  |       Owner:  (none)
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:  6.9
Component:  HTML API     |     Version:
 Severity:  normal       |  Resolution:
 Keywords:  needs-patch  |     Focuses:
-------------------------+---------------------
Description changed by westonruter:

Old description:

> The [https://wordpress.org/plugins/optimization-detective/ Optimization
> Detective] plugin from the Core Performance Team extends
> `WP_HTML_Tag_Processor` with some features from `WP_HTML_Processor` like
> `get_breadcrumbs()` and `get_current_depth()`. It also introduces its own
> method `get_xpath()` which computes an XPath expression to uniquely
> identify the element, for example:
>
> {{{
> /HTML/BODY/DIV[@class='wp-site-
> blocks']/*[1][self::HEADER]/*[1][self::DIV]/*[2][self::IMG]
> }}}
>
> See [https://github.com/WordPress/performance/blob/trunk/plugins
> /optimization-
> detective/docs/introduction.md#:~:text=The%20format%20of%20the%20XPath%20expression%20warrants%20further%20discussion.
> full documentation] for why the XPath is constructed like this. In short,
> `/HTML` and `/HTML/BODY` lack any node indices since there is no
> possibility for ambiguity. For children of the `BODY`, using node indices
> is not stable since arbitrary HTML may be printed at `wp_body_open()`,
> and for this reason it uses the `id`, `role`, or `class` attribute to add
> a disambiguating XPath predicate. For levels below this, elements are
> referenced as `*[1][self::IMG]` to target the an element that occurs at a
> specific position. If this were instead `/HEADER[1]` it would select the
> first `IMG` among other `IMG` elements, not the first `IMG` among all
> siblings. This ensures the XPath only matches an `IMG` when it is the
> first child, and it will no longer match if a `P` is inserted before it,
> for example.
>
> All this to say, `WP_HTML_Processor` does not keep track of element node
> indices, and it doesn't expose the attributes for the tags in the open
> stack (e.g. to get the `id`, `role`, or `class`). This would seem to make
> it more difficult to implement `get_xpath()` than maybe it should be.
> Ideally computing the XPath wouldn't require subclassing at all, and the
> information could be obtained from existing public methods. In
> Optimization Detective, the `WP_HTML_Tag_Processor` class is extended and
> the `next_token()` method is overridden so it can construct its own
> breadcrumbs and then also compute the node indices and capture the
> attributes at a given depth.
>
> All this to say, I suggest that in addition to `get_breadcrumbs()` that
> there be a way to get more information from the open stack of tags,
> including the attributes for each tag and the node index for each.

New description:

 The [https://wordpress.org/plugins/optimization-detective/ Optimization
 Detective] plugin from the Core Performance Team extends
 `WP_HTML_Tag_Processor` with some features from `WP_HTML_Processor` like
 `get_breadcrumbs()` and `get_current_depth()`. It also introduces its own
 method `get_xpath()` which computes an XPath expression to uniquely
 identify the element, for example:

 {{{
 /HTML/BODY/DIV[@class='wp-site-
 blocks']/*[1][self::HEADER]/*[1][self::DIV]/*[2][self::IMG]
 }}}

 See [https://github.com/WordPress/performance/blob/trunk/plugins
 /optimization-
 detective/docs/introduction.md#:~:text=The%20format%20of%20the%20XPath%20expression%20warrants%20further%20discussion.
 full documentation] for why the XPath is constructed like this. In short,
 `/HTML` and `/HTML/BODY` lack any node indices since there is no
 possibility for ambiguity. For children of the `BODY`, using node indices
 is not stable since arbitrary HTML may be printed at `wp_body_open()`, and
 for this reason it uses the `id`, `role`, or `class` attribute to add a
 disambiguating XPath predicate. For levels below this, elements are
 referenced as `*[1][self::IMG]` to target the an element that occurs at a
 specific position. If this were instead `/HEADER[1]` it would select the
 first `IMG` among other `IMG` elements, not the first `IMG` among all
 siblings. This ensures the XPath only matches an `IMG` when it is the
 first child, and it will no longer match if a `P` is inserted before it,
 for example.

 All this to say, `WP_HTML_Processor` does not keep track of element node
 indices, and it doesn't expose the attributes for the tags in the open
 stack (e.g. to get the `id`, `role`, or `class`). This would seem to make
 it more difficult to implement `get_xpath()` than maybe it should be.
 Ideally computing the XPath wouldn't require subclassing at all, and the
 information could be obtained from existing public methods. In
 Optimization Detective, the `WP_HTML_Tag_Processor` class is extended and
 the `next_token()` method is overridden so it can construct its own
 breadcrumbs and then also compute the node indices and capture the
 attributes at a given depth.

 All this to say, I suggest that in addition to `get_breadcrumbs()` that
 there be a way to get more information from the open stack of tags,
 including the attributes for each tag and the node index for each.

 In other words, it's currently possible to construct an XPath like
 `/HTML/BODY` as follows:

 {{{#!php
 <?php
 $xpath = array_map(
         function ( string $breadcrumb ): string {
                 return "/$breadcrumb";
         },
         $processor->get_breadcrumbs()
 );
 }}}

 But I'm proposing something like `get_element_breadcrumbs()` which would
 return objects for each open tag on the stack instead of just the tag
 name. So then you could construct a full unambiguous XPath:

 {{{#!php
 <?php
 $xpath = array_map(
         function ( WP_Element $breadcrumb ): string {
                 $expression = '/*';
                 $expression .= sprintf( '[self::*]',
 $breadcrumb->get_tag() );
                 foreach ( array( 'id', 'role', 'class' ) as
 $attribute_name ) {
                         $attribute = $breadcrumb->get_attribute(
 $attribute_name );
                         if ( is_string( $attribute ) ) {
                                 $expression .= sprintf( '[@%s="%s"]',
 $breadcrumb->get_tag(), addcslashes( $attribute, '\\"' ) );
                                 break;
                         }
                 }
                 return $expression;
         },
         $processor->get_element_breadcrumbs()
 );
 }}}

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63020#comment:3>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform