[wp-trac] [WordPress Trac] #60841: Create token lookup class.
WordPress Trac
noreply at wordpress.org
Mon Mar 25 17:13:32 UTC 2024
#60841: Create token lookup class.
-----------------------------+-----------------------------
Reporter: dmsnell | Owner: (none)
Type: feature request | Status: new
Priority: normal | Milestone: Awaiting Review
Component: General | Version: trunk
Severity: normal | Keywords: has-patch
Focuses: |
-----------------------------+-----------------------------
In the HTML API it would be nice to perform spec-compliant attribute
encoding and decoding. This requires a divergence from current Core
behavior; namely, to rely on the set of defined named character references
rather than a list that has been combined over time.
Specifically when parsing HTML character references I keep finding the
need for two operations:
- is this substring an HTML character reference?
- what, if any, is the next character reference in this string at this
offset?
For the first question it's possible to use `in_array( $substring,
$character_reference_names, true )`, even though this may not be optimal.
It's a reasonable first step.
For the second question I'm unaware of a strong implementation available
in PHP or WordPress. The change motivating this is moving from //split a
string into potential character references and non-character-reference
contents and then check if each one is valid// and towards a single-pass
parser that is less likely to be confused by strings that are almost
character references. There is a potential for improving the speed of
handling character references, but it is not a driving goal.
{{{#!php
<?php
$name = $entity_set->read_token( $html, $at );
if ( false !== $name ) {
$decoded .= substr( $html, $was_at, $was_at - $at ) .
entity_lookup( 'attribute', $name );
$at += strlen( $name );
} else {
++$at;
}
}}}
https://github.com/WordPress/wordpress-develop/pull/5337/files#diff-
e2bc0d3d983191acdb2effe67311dc37666eae53d59983281b34a7b4eed238acR1124-R1163
A final need involves mapping a token to a value, but this may be best
relegated to another class or application code. For example, mapping from
named character reference to UTF-8 bytes representing the corresponding
Code Point.
----
My proposal is adding a new class `WP_Token_Set` providing at least two
methods for normal use:
- `contains( $token )` returns whether the passed string is in the set.
- `read_token( $text, $offset = 0 )` indicates if the character sequence
starting at the given offset in the passed string forms a token in the
set, and if so, returns the longest matching sequence.
It also provides utility functions for pre-computing these classes, as
they are designed for relatively-static cases where the actual code is
intended to be generated dynamically, but stay static over time. For
example, HTML5 defines the set of named character references and indicates
that the list //shall not// change or be expanded.
[https://html.spec.whatwg.org/#named-character-references-table HTML5
spec]
- `static::from_array( array $words )` generates a new token set from the
given array of tokens.
- `to_array()` dumps the set of tokens into an array of string tokens.
- `static::from_precomputed_table( $table )` instantiates a token set
from a precomputed table, skipping the computation for building the table
and sorting the tokens.
- `precomputed_php_source_table()` generates PHP source code which can be
loaded with the previous static method for maintenance of the core static
token sets.
----
Having a semantic class for this work provides an opportunity to optimize
lookup without demanding that the user-space (or Core-space) code change.
There are more methods that could be useful but which aren't included in
the proposal because they haven't been necessary:
- `add( $token )` and `remove( $token )` for dynamically altering the
table.
Also, this is currently limited to store tokens of byte-length <= 256 for
practical implementation details (it has not been necessary to store
longer tokens).
## How can I provide feedback?
I'm happy to do all the work of documenting the class, adding tests, and
cleaning up the code for inclusion. What I'd appreciate from you is
feedback on the idea, the naming of the class and the methods, whether you
have considered other ideas similar to this before, feedback on the
general approach taken in the linked PRs.
Thank you all.
--
Ticket URL: <https://core.trac.wordpress.org/ticket/60841>
WordPress Trac <https://core.trac.wordpress.org/>
WordPress publishing platform
More information about the wp-trac
mailing list