[wp-trac] [WordPress Trac] #60698: Add optimized set lookup class.
WordPress Trac
noreply at wordpress.org
Wed Mar 6 01:18:34 UTC 2024
#60698: Add optimized set lookup class.
-----------------------------+-----------------------------
Reporter: dmsnell | Owner: (none)
Type: feature request | Status: new
Priority: normal | Milestone: Awaiting Review
Component: General | Version: trunk
Severity: normal | Keywords:
Focuses: |
-----------------------------+-----------------------------
In the course of exploratory development in the HTML API there have been a
few times where I wanted to test if a given string is in a set of
statically-known strings, and a few times where I wanted to check if the
next span of text represents an item in the set.
For the first case, `in_array()` is a suitable method, but isn't always
ideal when the test set is large.
{{{#!php
<?php
if ( in_array( '¬in', $html5_named_character_references, true ) )
}}}
For the second case, `in_array()` isn't adequate, and a more complicated
lookup is necessary.
{{{#!php
<?php
foreach ( $html5_named_character_references as $name ) {
if ( 0 === substr_compare( $html, $name, $at, /* length */ null,
/* case insensitive */ true ) ) {
…
return $name;
}
}
}}}
This second example demonstrates some catastrophic lookup characteristics
when it's not certain if the following input is any token from the set,
let alone which one it might be. The at-hand code has to iterate the
search domain and then compare every single possibility against the input
string, bailing when one is found.
While reviewing code in various places I've noticed a similar pattern and
need, mostly being served by `in_array()` and a regex that splits apart an
input string into a large array, allocating substrings for each array
element, and then calling `in_array()` inside the regex callback (or when
the results array is passed to another function). This is all memory heavy
and inefficient in the runtime.
----
I'd like to propose a new class whose semantic is a relatively static set
of terms or tokens which provides a test for membership within the set,
and what the next matching term or token is at a given offset in a string,
if the next sequence of characters matches one.
{{{#!php
<?php
$named_character_references = WP_Token_Set( [ '¬in', '∉',
'&', … ] );
if ( $named_character_references->contains( '¬in' ) ) {
…
}
while ( true ) {
$was_at = $at;
$at = strpos( $text, '&', $at );
if ( false === $at ) {
$output .= substr( $text, $was_at )
break;
}
$name = $named_character_reference->read_token( $text, $at );
if ( false !== $name ) {
$output .= substr( $text, $was_at, $at - $was_at );
$output .= $named_character_replacements[ $name ];
$at += strlen( $name );
continue;
}
// No named character reference was found, continue searching.
++$at;
}
}}}
----
Further, because WordPress largely deals with large and relatively static
token sets (named character references, allowable URL schemes, file types,
loaded templates, etc…), it would be nice to be able to precompute the
lookup tables if they are at all costly, as doing so on every PHP load is
unnecessarily burdensome.
A bonus feature would be a method to add and a method to remove terms.
----
In [https://github.com/WordPress/wordpress-develop/pull/5373 #5373] I have
proposed such a `WP_Token_Set` and used it in
[https://github.com/WordPress/wordpress-develop/pull/5337 #5337] to create
a spec-compliant, low-memory-overhead, and efficient replacement for
`esc_attr()`.
The replacement `esc_attr()` is able to more reliably parse attribute
values than the existing code and it does so more efficiently, avoiding
numerous memory allocations and lookups.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/60698>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list