[wp-trac] [WordPress Trac] #62269: WP_HTML_Processor::next_token() cannot be extended in subclasses to keep track of state

Mon Oct 21 23:21:59 UTC 2024

#62269: WP_HTML_Processor::next_token() cannot be extended in subclasses to keep
track of state
--------------------------+--------------------
 Reporter:  westonruter   |      Owner:  (none)
     Type:  defect (bug)  |     Status:  new
 Priority:  normal        |  Milestone:  6.7
Component:  HTML API      |    Version:  6.5
 Severity:  normal        |   Keywords:
  Focuses:                |
--------------------------+--------------------
 In the Optimization Detective plugin from the WordPress Performance team
 there is a need to compute a precise locator for each tag in a document
 beyond just what `get_breadcrumbs()` provides. In particular, there is a
 need to disambiguate between two tags which may be siblings of each other
 in which case `array( 'html', 'body', 'img' )` will be ambiguous.
 Currently we're using XPaths for this purpose, for example if there are
 three `IMG` tags appearing as siblings at the beginning of the `BODY`,
 their XPaths are computed as:

 * `/*[1][self::HTML]/*[2][self::BODY]/*[1][self::IMG]`
 * `/*[1][self::HTML]/*[2][self::BODY]/*[2][self::IMG]`
 * `/*[1][self::HTML]/*[2][self::BODY]/*[3][self::IMG]`

 In order to compute these XPaths with HTML Tag Processor, the plugin
 extends the `WP_HTML_Tag_Processor` class with an wrapped version of
 `next_token()` so it can keep track of each new tag encountered to build
 up the array structure to compute the XPath.

 This turns out not to work when extending `WP_HTML_Processor` because
 `WP_HTML_Processor::next_token()` often does recursive calls, resulting in
 erroneous XPath indices being computed. For example, `next_token()` is
 called twice when processing `<html>` and three times when processing
 `<body>`, at least in my sample doc.

 The fix seems simple: move the logic from
 `WP_HTML_Processor::next_token()` into another private method like
 `WP_HTML_Processor::_next_token()` and update any recursive references to
 also call `WP_HTML_Processor::_next_token()`. Then
 `WP_HTML_Processor::next_token()` can simply just call
 `WP_HTML_Processor::_next_token()` and extending classes will be able to
 rely on each invocation of `next_token` corresponding to a new token. This
 would also be similar to what `WP_HTML_Tag_Processor::next_token()` does
 in that it is simply wrapping a call to
 `WP_HTML_Tag_Processor::base_class_next_token()`.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/62269>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform