[wp-trac] [WordPress Trac] #63020: HTML API: Breadcrumbs should include element indices and attributes

Wed Feb 26 19:39:34 UTC 2025

#63020: HTML API: Breadcrumbs should include element indices and attributes
-------------------------+---------------------
 Reporter:  westonruter  |       Owner:  (none)
     Type:  enhancement  |      Status:  new
 Priority:  normal       |   Milestone:  6.9
Component:  HTML API     |     Version:
 Severity:  normal       |  Resolution:
 Keywords:  needs-patch  |     Focuses:
-------------------------+---------------------
Description changed by westonruter:

Old description:

> The [https://wordpress.org/plugins/optimization-detective/ Optimization
> Detective] plugin from the Core Performance Team extends
> `WP_HTML_Tag_Processor` with some features from `WP_HTML_Processor` like
> `get_breadcrumbs()` and `get_current_depth()`. It also introduces its own
> method `get_xpath()` which computes an XPath expression to uniquely
> identify the element, for example:
>
> {{{
> /HTML/BODY/DIV[@class='wp-site-
> blocks']/*[1][self::HEADER]/*[1][self::DIV]/*[2][self::IMG]
> }}}
>
> See [https://github.com/WordPress/performance/blob/trunk/plugins
> /optimization-
> detective/docs/introduction.md#:~:text=The%20format%20of%20the%20XPath%20expression%20warrants%20further%20discussion.
> full documentation] for why the XPath is constructed like this. In short,
> `/HTML` and `/HTML/BODY` lack any node indices since there is no
> possibility for ambiguity. For children of the `BODY`, using node indices
> is not stable since arbitrary HTML may be printed at `wp_body_open()`,
> and for this reason it uses the `id`, `role`, or `class` attribute to add
> a disambiguating XPath predicate. For levels below this, elements are
> referenced as `*[1][self::IMG]` to target the an element that occurs at a
> specific position. If this were instead `/HEADER[1]` it would select the
> first `IMG` among other `IMG` elements, not the first `IMG` among all
> siblings. This ensures the XPath only matches an `IMG` when it is the
> first child, and it will no longer match if a `P` is inserted before it,
> for example.
>
> All this to say, `WP_HTML_Processor` does not keep track of element node
> indices, and it doesn't expose the attributes for the tags in the open
> stack (e.g. to get the `id`, `role`, or `class`). This would seem to make
> it more difficult to implement `get_xpath()` than maybe it should be.
> Ideally computing the XPath wouldn't require subclassing at all, and the
> information could be obtained from existing public methods. In
> Optimization Detective, the `WP_HTML_Tag_Processor` class is extended and
> the `next_token()` method is overridden so it can construct its own
> breadcrumbs and then also compute the node indices and capture the
> attributes at a given depth.

New description:

 The [https://wordpress.org/plugins/optimization-detective/ Optimization
 Detective] plugin from the Core Performance Team extends
 `WP_HTML_Tag_Processor` with some features from `WP_HTML_Processor` like
 `get_breadcrumbs()` and `get_current_depth()`. It also introduces its own
 method `get_xpath()` which computes an XPath expression to uniquely
 identify the element, for example:

 {{{
 /HTML/BODY/DIV[@class='wp-site-
 blocks']/*[1][self::HEADER]/*[1][self::DIV]/*[2][self::IMG]
 }}}

 See [https://github.com/WordPress/performance/blob/trunk/plugins
 /optimization-
 detective/docs/introduction.md#:~:text=The%20format%20of%20the%20XPath%20expression%20warrants%20further%20discussion.
 full documentation] for why the XPath is constructed like this. In short,
 `/HTML` and `/HTML/BODY` lack any node indices since there is no
 possibility for ambiguity. For children of the `BODY`, using node indices
 is not stable since arbitrary HTML may be printed at `wp_body_open()`, and
 for this reason it uses the `id`, `role`, or `class` attribute to add a
 disambiguating XPath predicate. For levels below this, elements are
 referenced as `*[1][self::IMG]` to target the an element that occurs at a
 specific position. If this were instead `/HEADER[1]` it would select the
 first `IMG` among other `IMG` elements, not the first `IMG` among all
 siblings. This ensures the XPath only matches an `IMG` when it is the
 first child, and it will no longer match if a `P` is inserted before it,
 for example.

 All this to say, `WP_HTML_Processor` does not keep track of element node
 indices, and it doesn't expose the attributes for the tags in the open
 stack (e.g. to get the `id`, `role`, or `class`). This would seem to make
 it more difficult to implement `get_xpath()` than maybe it should be.
 Ideally computing the XPath wouldn't require subclassing at all, and the
 information could be obtained from existing public methods. In
 Optimization Detective, the `WP_HTML_Tag_Processor` class is extended and
 the `next_token()` method is overridden so it can construct its own
 breadcrumbs and then also compute the node indices and capture the
 attributes at a given depth.

 All this to say, I suggest that in addition to `get_breadcrumbs()` that
 there be a way to get more information from the open stack of tags,
 including the attributes for each tag and the node index for each.

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/63020#comment:2>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform