[wp-hackers] Importing HTML files as pages -- been done?

Mon Feb 9 00:07:29 GMT 2009

Mike Schinkel wrote:
> "Dougal Campbell" <dougal at gunters.org> wrote:
>   
>> No, a DOM-based approach is definitely better than regex. 
>> Regexes for parsing HTML can get *extremely* complicated, 
>> and if you start trying to write a regex-based parser from 
>> scratch, you'll almost certainly miss some things. 
>>     
>
> I agree, in general.  In her specific case she said that she'd have enclosing <div>s with unique IDs identifying the content to select. That <div> would be easy to find even with strpos() and then from there a simple loop to find the applicable closing </div> would work.  Yes there are potential issues with that approach, but they would be rare.  For a general purpose tool those limitations wouldn't be acceptable but for a quick & dirty tool to accomplish a specific conversion it would be sufficient and easy.
>   

and

Mike Little wrote:
> If you have the fortune to only need to parse machine generated XHTML,
> it may be worth having a look at my DITA importer I just released (
> http://zed1.com/journalized/wordpress-plugins/dita-to-wordpress-import-tool/
> )
>
> I just used the plain old PHP5 DOM manipulation classes to do the work.
> There are examples of finding, removing, modifying, and adding in
> elements using straight DOM and also XPath.
> The code is far from elegant, but it might give someone a start.
>
>   

True, if you know for sure that you have well-formed input, then it 
greatly simplifies things. But it seemed that the discussion was 
wandering towards the idea of a more generalized tool that could deal 
with arbitrary sets of files. And since I've been dealing with such a 
case for some time now, it was sort of weighing down my brain in that 
direction anyways. :)

-- 
Dougal Campbell <dougal at gunters.org>
http://dougal.gunters.org/
http://twitter.com/dougal
*Hire me!*