[wp-trac] [WordPress Trac] #60698: Add optimized set lookup class.

Wed Mar 6 01:18:34 UTC 2024

#60698: Add optimized set lookup class.
-----------------------------+-----------------------------
 Reporter:  dmsnell          |      Owner:  (none)
     Type:  feature request  |     Status:  new
 Priority:  normal           |  Milestone:  Awaiting Review
Component:  General          |    Version:  trunk
 Severity:  normal           |   Keywords:
  Focuses:                   |
-----------------------------+-----------------------------
 In the course of exploratory development in the HTML API there have been a
 few times where I wanted to test if a given string is in a set of
 statically-known strings, and a few times where I wanted to check if the
 next span of text represents an item in the set.

 For the first case, `in_array()` is a suitable method, but isn't always
 ideal when the test set is large.

 {{{#!php
 <?php
 if ( in_array( '&notin', $html5_named_character_references, true ) )
 }}}

 For the second case, `in_array()` isn't adequate, and a more complicated
 lookup is necessary.

 {{{#!php
 <?php
 foreach ( $html5_named_character_references as $name ) {
         if ( 0 === substr_compare( $html, $name, $at, /* length */ null,
 /* case insensitive */ true ) ) {
                 …
                 return $name;
         }
 }
 }}}

 This second example demonstrates some catastrophic lookup characteristics
 when it's not certain if the following input is any token from the set,
 let alone which one it might be. The at-hand code has to iterate the
 search domain and then compare every single possibility against the input
 string, bailing when one is found.

 While reviewing code in various places I've noticed a similar pattern and
 need, mostly being served by `in_array()` and a regex that splits apart an
 input string into a large array, allocating substrings for each array
 element, and then calling `in_array()` inside the regex callback (or when
 the results array is passed to another function). This is all memory heavy
 and inefficient in the runtime.

 ----

 I'd like to propose a new class whose semantic is a relatively static set
 of terms or tokens which provides a test for membership within the set,
 and what the next matching term or token is at a given offset in a string,
 if the next sequence of characters matches one.

 {{{#!php
 <?php
 $named_character_references = WP_Token_Set( [ '&notin', '∉',
 '&', … ] );

 if ( $named_character_references->contains( '&notin' ) ) {
         …
 }

 while ( true ) {
         $was_at = $at;
         $at = strpos( $text, '&', $at );
         if ( false === $at ) {
                 $output .= substr( $text, $was_at )
                 break;
         }

         $name = $named_character_reference->read_token( $text, $at );
         if ( false !== $name ) {
                 $output .= substr( $text, $was_at, $at - $was_at );
                 $output .= $named_character_replacements[ $name ];
                 $at     += strlen( $name );
                 continue;
         }

         // No named character reference was found, continue searching.
         ++$at;
 }
 }}}

 ----

 Further, because WordPress largely deals with large and relatively static
 token sets (named character references, allowable URL schemes, file types,
 loaded templates, etc…), it would be nice to be able to precompute the
 lookup tables if they are at all costly, as doing so on every PHP load is
 unnecessarily burdensome.

 A bonus feature would be a method to add and a method to remove terms.

 ----

 In [https://github.com/WordPress/wordpress-develop/pull/5373 #5373] I have
 proposed such a `WP_Token_Set` and used it in
 [https://github.com/WordPress/wordpress-develop/pull/5337 #5337] to create
 a spec-compliant, low-memory-overhead, and efficient replacement for
 `esc_attr()`.

 The replacement `esc_attr()` is able to more reliably parse attribute
 values than the existing code and it does so more efficiently, avoiding
 numerous memory allocations and lookups.

-- 
Ticket URL: <https://core.trac.wordpress.org/ticket/60698>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform