[wp-hackers] WP issues

Geoffrey Sneddon foolistbar at googlemail.com
Mon Jun 4 13:48:57 GMT 2007


On 3 Jun 2007, at 16:06, Sam Angove wrote:

> On 6/3/07, Geoffrey Sneddon <foolistbar at googlemail.com> wrote:
>>
>> Parse it as HTML, then serialise it. You'll have well-formed and
>> valid output. Trying to parse HTML using regular expressions will
>> never work ...
>
> A serialiser approaches irrelevancy if we have a sufficiently good
> HTML parser. The single most important and problematic part of the
> system is the part that handles unexpected input, which is to say,
> errors. That part is presumably in the parser.

How do you propose you get from the parser output to XHTML, if not a  
serialiser?

> If we had such a parser, well-formedness would be a trivial problem.
> That has *nothing* to do with a serialiser: if we can guarantee good
> input, good output is easy.

The rules for HTML and XHTML differ (though if serving as text/html  
they follow neither specification), and you must therefore convert  
between the two (which is harder than it sounds).

> But we have not had such a parser, and no-one has ever offered one. If
> you have one, that's a game-changer. I would very much like the
> technical details, or better yet the code.

There's an SGML parser in PEAR (which of course isn't overly relevant  
in the real world, as no browser uses an SGML parser, though would be  
better than what we currently have):
http://pear.php.net/package/XML_HTMLSax3

And there's several HTML5 parsers under development (HTML5 aims to be  
compatible with classic HTML parsers, and current web content):
http://php-html5lib.dashslot.net/
http://jero.net/lab/ph5p/

>> I was advocating either ensuring that the output is valid (as any
>> XHTML parser can throw a fatal error on invalid XHTML (and yes, I do
>> mean invalid, not malformed)) or not using something where fatal
>> errors can be thrown (such as HTML).
>
> I agree in theory, but there are practical considerations. If you have
> a workable solution to the XHTML validity problem, hooray; I'll wait
> for the details. (This comes right back to the HTML parser.) If not,
> there are good reasons for maintaining the status quo.

What about the second part of that comment, what about using HTML  
(which gets around the well-formed constraint of XHTML)?


- Geoffrey Sneddon




More information about the wp-hackers mailing list