[wp-hackers] pulling a massive HTML site into Wordpress

Alex Andrews awgandrews at gmail.com
Mon Jun 6 10:30:36 UTC 2011


Yup sadly there is no way of doing this without hacking some PHP. But
Baki's instructions are entirely correct - you could do it as a
command line tool. I'm not sure about the latter instruction - posts
and pages, as far as the database is concerned, are basically the same
thing.

I did something similar not long ago, using Ruby to do it, for fun.

Alex

On 6 June 2011 11:23, Baki Goxhaj <banago at gmail.com> wrote:
> Well, if you don't know some PHP I don't know how you are going to do it,
> but her is my advice.
>
> 1. Use http://simplehtmldom.sourceforge.net/ to pull content and from the
> old site and map it accordingly to WordPress fields
> 2. Use a custom script to insert posts - like the one you quoted above that
> makes use of wp_insert_posts() function
> 3. Import content e posts rather than pages as so much pages don't scale and
> will kill your site
>
> Good luck.
>
> Kindly,
>
> Baki Goxhaj
> www.wplancer.com | proverbhunter.com | www.banago.info<http://proverbhunter.com>
>
>
> On Mon, Jun 6, 2011 at 11:15 AM, John Black <immanence7 at gmail.com> wrote:
>
>> Hi there,
>>
>> I have a gargantuan HTML based site I want to port to Wordpress. I'm
>> talking 52,000 individual HTML pages, with a further 10,000 pages with
>> minimal content (mostly pictures with captions) that are basically child
>> pages of these main pages.
>>
>> In addition to the 60,000 HTML pages in total, the site has around 52,000
>> images embedded across these pages.
>>
>> Happily, there is some regularity across these pages, and it appears (from
>> a quick look) that some elements (e.g., names of the authors, and extracts)
>> could be ripped out along with the main content. Not all the archive is
>> regularised, however. Site content goes back to 1998.
>>
>> First, I know I'm going to need a solid strategy here. I kind of need help
>> with that, for although I face this massive task, I'm actually a graphic
>> designer, not a programmer.
>>
>> I know I will need to design a Wordpress page or post structure to
>> ultimately contain the various parts of these pages I want to rip.
>>
>> But most of all, I need to know how to start.
>>
>> 1. I looked at the discussion had here on wp-hackers in February on
>> "Porting static content". It gives a few leads, but I have questions.
>>
>> 2. I looked at the plugin Import HTML Pages. I need to look closer, but a
>> first run attempt on a limited number of files failed. I'm working on a
>> localhost MAMP set up.
>>
>> 3. The consensus from the earlier February discussion was that PHP Simple
>> HTML DOM Parser (http://sourceforge.net/projects/simplehtmldom/) was a
>> great tool. I looked it, but have no idea how to even start using this. I
>> don't find documentation that speaks to someone as code illiterate as me.
>> Can it be used on a localhost MAMP installation?
>>
>> 4. Someone else mentioned PhpQuery (http://code.google.com/p/phpquery/). I
>> haven't the faintest idea how to use that either.
>>
>> Basically, I need some kind of heads up here.
>>
>> Can someone give a kind of overview of what is possible? Does anyone have
>> any scripts they could share with me that might streamline this and save me
>> from an impossible learning curve?
>>
>> I include below some excerpts of the discussion in February. If anyone can
>> elaborate further on some of that, I'd love to read more!
>>
>>
>> Thanks so much for reading, and sorry for not understanding code. My head
>> works in another way.
>>
>> best,
>> John Black
>>
>> ______________________
>>
>>
>> Bill Dennen dennen at gmail.com
>> Wed Feb 23 01:29:35 UTC 2011
>>
>> Previous message: [wp-hackers] Porting static content
>> Next message: [wp-hackers] Porting static content
>> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> >> I've used http://sourceforge.net/projects/simplehtmldom/ a number of
>> times.
>>
>> I've had good luck with that, and phpQuery.
>>
>> http://code.google.com/p/phpquery/
>> phpQuery is a server-side, chainable, CSS3 selector driven Document
>> Object Model (DOM) API based on jQuery JavaScript Library
>>
>> In our case, we first build a sitemap of the pages we wanted to
>> import. It's basically a hierarchy built as an mult-level, unordered
>> list. This allows us to maintain parent-child relationships between
>> pages when they are imported into WordPress. We loop over that
>> unordered list of links, scrape each page, and use phpQuery to select
>> different parts of the page based on jQuery selectors. We can also add
>> custom fields to these imported page using add_post_meta.
>>
>> -Bill
>>
>> ______________________
>>
>>
>> Christopher Ross cross at thisismyurl.com
>> Tue Feb 22 22:55:22 UTC 2011
>>
>> Previous message: [wp-hackers] Porting static content
>> Next message: [wp-hackers] Porting static content
>> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> Scot, this sounds like a perfect use of the wp_insert_post() function
>>
>>
>>        $post = array(
>>          'comment_status' => 'closed',
>>          'ping_status' => 'closed',
>>          'post_author' => $authorID,
>>          'post_category' => $cat,
>>          'post_content' =>
>> mysql_real_escape_string($_POST['post_content']),
>>          'post_excerpt' =>
>> mysql_real_escape_string($_POST['post_excerpt']),
>>          'post_status' => $blogpoststatus,
>>          'post_title' => $posttitle,
>>          'post_type' => 'post',
>>          'tags_input' =>  $_POST['post_tags']
>>        );
>>        $wpid = wp_insert_post($post);
>>
>>
>> I did a government site a while back with similar restrictions, after
>> downloading the content to a directory using an offline viewer I simply ran
>> RegEx on the content until I had 50,000 usable documents. After that, I
>> simply ran a PHP script to pull in 100 pages at time, post to the WP
>> database and move those files.
>>
>> ______________________
>>
>>
>> Keith P. Graham kpgraham at gmail.com
>> Wed Feb 23 15:00:31 UTC 2011
>>
>> Previous message: [wp-hackers] Porting static content
>> Next message: [wp-hackers] WordPress.org API
>> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> I wrote a simple plugin called Instant-Content-plugin
>> http://wordpress.org/extend/plugins/instant-content-plugin to import
>> static files into posts. It uses text files, but you can play with the
>> code to get that the way you want, commenting out the code that
>> replaces crlf with br tags. I made it so it can import data from a zip
>> file of static pages, using the first line of each file as a title.
>>
>> I've used this to import large amounts of static data into WP. Most
>> recently I have been using it to import pages from my older sites into
>> WP. I had several sites with hundreds of pages of content that just
>> grew over the years. Each page had a slightly different layout. I
>> loaded all the pages into Notebook++ and made global changes until I
>> got just bare content. I then zipped up the files and loaded them into
>> WP pages using the instant-content plugin.
>>
>> Keith
>>
>> ______________________
>>
>>
>>
>> Paul paul at codehooligans.com
>> Tue Feb 22 22:49:58 UTC 2011
>>
>> Previous message: [wp-hackers] Porting static content
>> Next message: [wp-hackers] Porting static content
>> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> Scott.
>>
>> I've used http://sourceforge.net/projects/simplehtmldom/ a number of
>> times.
>>
>> I'll send you a working script I use to suck down a site.
>>
>> P-
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> wp-hackers mailing list
>> wp-hackers at lists.automattic.com
>> http://lists.automattic.com/mailman/listinfo/wp-hackers
>>
> _______________________________________________
> wp-hackers mailing list
> wp-hackers at lists.automattic.com
> http://lists.automattic.com/mailman/listinfo/wp-hackers
>


More information about the wp-hackers mailing list