[wp-trac] [WordPress Trac] #62269: WP_HTML_Processor::next_token() cannot be extended in subclasses to keep track of state

Mon Oct 21 23:28:51 UTC 2024

#62269: WP_HTML_Processor::next_token() cannot be extended in subclasses to keep
track of state
--------------------------+---------------------
 Reporter:  westonruter   |       Owner:  (none)
     Type:  defect (bug)  |      Status:  new
 Priority:  normal        |   Milestone:  6.7
Component:  HTML API      |     Version:  6.5
 Severity:  normal        |  Resolution:
 Keywords:                |     Focuses:
--------------------------+---------------------
Changes (by westonruter):

 * keywords:  has-patch has-unit-tests =>

Old description:

> In the Optimization Detective plugin from the WordPress Performance team
> there is a need to compute a precise locator for each tag in a document
> beyond just what `get_breadcrumbs()` provides. In particular, there is a
> need to disambiguate between two tags which may be siblings of each other
> in which case `array( 'html', 'body', 'img' )` will be ambiguous.
> Currently we're using XPaths for this purpose, for example if there are
> three `IMG` tags appearing as siblings at the beginning of the `BODY`,
> their XPaths are computed as:
>
> * `/*[1][self::HTML]/*[2][self::BODY]/*[1][self::IMG]`
> * `/*[1][self::HTML]/*[2][self::BODY]/*[2][self::IMG]`
> * `/*[1][self::HTML]/*[2][self::BODY]/*[3][self::IMG]`
>
> In order to compute these XPaths with HTML Tag Processor, the plugin
> extends the `WP_HTML_Tag_Processor` class with an wrapped version of
> `next_token()` so it can keep track of each new tag encountered to build
> up the array structure to compute the XPath.
>
> This turns out not to work when extending `WP_HTML_Processor` because
> `WP_HTML_Processor::next_token()` often does recursive calls, resulting
> in erroneous XPath indices being computed. For example, `next_token()` is
> called twice when processing `<html>` and three times when processing
> `<body>`, at least in my sample doc.
>
> The fix seems simple: move the logic from
> `WP_HTML_Processor::next_token()` into another private method like
> `WP_HTML_Processor::_next_token()` and update any recursive references to
> also call `WP_HTML_Processor::_next_token()`. Then
> `WP_HTML_Processor::next_token()` can simply just call
> `WP_HTML_Processor::_next_token()` and extending classes will be able to
> rely on each invocation of `next_token` corresponding to a new token.
> This would also be similar to what `WP_HTML_Tag_Processor::next_token()`
> does in that it is simply wrapping a call to
> `WP_HTML_Tag_Processor::base_class_next_token()`.

New description:

 In the Optimization Detective plugin from the WordPress Core Performance
 Team there is a need to compute a precise locator for each tag in a
 document beyond just what `get_breadcrumbs()` provides. In particular,
 there is a need to disambiguate between two tags which may be siblings of
 each other in which case `array( 'html', 'body', 'img' )` will be
 ambiguous. Currently we're using XPaths for this purpose, for example if
 there are three `IMG` tags appearing as siblings at the beginning of the
 `BODY`, their XPaths are computed as:

 * `/*[1][self::HTML]/*[2][self::BODY]/*[1][self::IMG]`
 * `/*[1][self::HTML]/*[2][self::BODY]/*[2][self::IMG]`
 * `/*[1][self::HTML]/*[2][self::BODY]/*[3][self::IMG]`

 In order to compute these XPaths with HTML Tag Processor, the plugin
 extends the `WP_HTML_Tag_Processor` class with an wrapped version of
 `next_token()` so it can keep track of each new tag encountered to build
 up the array structure to compute the XPath.

 This turns out not to work when extending `WP_HTML_Processor` because
 `WP_HTML_Processor::next_token()` often does recursive calls, resulting in
 erroneous XPath indices being computed. For example, `next_token()` is
 called twice when processing `<html>` and three times when processing
 `<body>`, at least in my sample doc.

 The fix seems simple: move the logic from
 `WP_HTML_Processor::next_token()` into another private method like
 `WP_HTML_Processor::_next_token()` and update any recursive references to
 also call `WP_HTML_Processor::_next_token()`. Then
 `WP_HTML_Processor::next_token()` can simply just call
 `WP_HTML_Processor::_next_token()` and extending classes will be able to
 rely on each invocation of `next_token` corresponding to a new token. This
 would also be similar to what `WP_HTML_Tag_Processor::next_token()` does
 in that it is simply wrapping a call to
 `WP_HTML_Tag_Processor::base_class_next_token()`.

--

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/62269#comment:2>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform