[wp-trac] [WordPress Trac] #60841: Create token lookup class.

Mon Mar 25 17:13:32 UTC 2024

#60841: Create token lookup class.
-----------------------------+-----------------------------
 Reporter:  dmsnell          |      Owner:  (none)
     Type:  feature request  |     Status:  new
 Priority:  normal           |  Milestone:  Awaiting Review
Component:  General          |    Version:  trunk
 Severity:  normal           |   Keywords:  has-patch
  Focuses:                   |
-----------------------------+-----------------------------
 In the HTML API it would be nice to perform spec-compliant attribute
 encoding and decoding. This requires a divergence from current Core
 behavior; namely, to rely on the set of defined named character references
 rather than a list that has been combined over time.

 Specifically when parsing HTML character references I keep finding the
 need for two operations:
  - is this substring an HTML character reference?
  - what, if any, is the next character reference in this string at this
 offset?

 For the first question it's possible to use `in_array( $substring,
 $character_reference_names, true )`, even though this may not be optimal.
 It's a reasonable first step.

 For the second question I'm unaware of a strong implementation available
 in PHP or WordPress. The change motivating this is moving from //split a
 string into potential character references and non-character-reference
 contents and then check if each one is valid// and towards a single-pass
 parser that is less likely to be confused by strings that are almost
 character references. There is a potential for improving the speed of
 handling character references, but it is not a driving goal.

 {{{#!php
 <?php
 $name = $entity_set->read_token( $html, $at );
 if ( false !== $name ) {
         $decoded .= substr( $html, $was_at, $was_at - $at ) .
 entity_lookup( 'attribute', $name );
         $at      += strlen( $name );
 } else {
         ++$at;
 }
 }}}

 https://github.com/WordPress/wordpress-develop/pull/5337/files#diff-
 e2bc0d3d983191acdb2effe67311dc37666eae53d59983281b34a7b4eed238acR1124-R1163

 A final need involves mapping a token to a value, but this may be best
 relegated to another class or application code. For example, mapping from
 named character reference to UTF-8 bytes representing the corresponding
 Code Point.

 ----

 My proposal is adding a new class `WP_Token_Set` providing at least two
 methods for normal use:

  - `contains( $token )` returns whether the passed string is in the set.
  - `read_token( $text, $offset = 0 )` indicates if the character sequence
 starting at the given offset in the passed string forms a token in the
 set, and if so, returns the longest matching sequence.

 It also provides utility functions for pre-computing these classes, as
 they are designed for relatively-static cases where the actual code is
 intended to be generated dynamically, but stay static over time. For
 example, HTML5 defines the set of named character references and indicates
 that the list //shall not// change or be expanded.
 [https://html.spec.whatwg.org/#named-character-references-table HTML5
 spec]

  - `static::from_array( array $words )` generates a new token set from the
 given array of tokens.
  - `to_array()` dumps the set of tokens into an array of string tokens.
  - `static::from_precomputed_table( $table )` instantiates a token set
 from a precomputed table, skipping the computation for building the table
 and sorting the tokens.
  - `precomputed_php_source_table()` generates PHP source code which can be
 loaded with the previous static method for maintenance of the core static
 token sets.

 ----

 Having a semantic class for this work provides an opportunity to optimize
 lookup without demanding that the user-space (or Core-space) code change.
 There are more methods that could be useful but which aren't included in
 the proposal because they haven't been necessary:

  - `add( $token )` and `remove( $token )` for dynamically altering the
 table.

 Also, this is currently limited to store tokens of byte-length <= 256 for
 practical implementation details (it has not been necessary to store
 longer tokens).

 ## How can I provide feedback?

 I'm happy to do all the work of documenting the class, adding tests, and
 cleaning up the code for inclusion. What I'd appreciate from you is
 feedback on the idea, the naming of the class and the methods, whether you
 have considered other ideas similar to this before, feedback on the
 general approach taken in the linked PRs.

 Thank you all.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/60841>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform