[wp-hackers] pulling a massive HTML site into Wordpress

Baki Goxhaj banago at gmail.com
Mon Jun 6 10:33:33 UTC 2011


>
> posts and pages, as far as the database is concerned, are basically the
> same thing.
>

Right, but as fare as the Rewrite Engline for beautiful permalinks is
concerned, they are totally different things.

Kindly,

Baki Goxhaj
www.wplancer.com | proverbhunter.com | www.banago.info<http://proverbhunter.com>


On Mon, Jun 6, 2011 at 12:30 PM, Alex Andrews <awgandrews at gmail.com> wrote:

> Yup sadly there is no way of doing this without hacking some PHP. But
> Baki's instructions are entirely correct - you could do it as a
> command line tool. I'm not sure about the latter instruction - posts
> and pages, as far as the database is concerned, are basically the same
> thing.
>
> I did something similar not long ago, using Ruby to do it, for fun.
>
> Alex
>
> On 6 June 2011 11:23, Baki Goxhaj <banago at gmail.com> wrote:
> > Well, if you don't know some PHP I don't know how you are going to do it,
> > but her is my advice.
> >
> > 1. Use http://simplehtmldom.sourceforge.net/ to pull content and from
> the
> > old site and map it accordingly to WordPress fields
> > 2. Use a custom script to insert posts - like the one you quoted above
> that
> > makes use of wp_insert_posts() function
> > 3. Import content e posts rather than pages as so much pages don't scale
> and
> > will kill your site
> >
> > Good luck.
> >
> > Kindly,
> >
> > Baki Goxhaj
> > www.wplancer.com | proverbhunter.com | www.banago.info<
> http://proverbhunter.com>
> >
> >
> > On Mon, Jun 6, 2011 at 11:15 AM, John Black <immanence7 at gmail.com>
> wrote:
> >
> >> Hi there,
> >>
> >> I have a gargantuan HTML based site I want to port to Wordpress. I'm
> >> talking 52,000 individual HTML pages, with a further 10,000 pages with
> >> minimal content (mostly pictures with captions) that are basically child
> >> pages of these main pages.
> >>
> >> In addition to the 60,000 HTML pages in total, the site has around
> 52,000
> >> images embedded across these pages.
> >>
> >> Happily, there is some regularity across these pages, and it appears
> (from
> >> a quick look) that some elements (e.g., names of the authors, and
> extracts)
> >> could be ripped out along with the main content. Not all the archive is
> >> regularised, however. Site content goes back to 1998.
> >>
> >> First, I know I'm going to need a solid strategy here. I kind of need
> help
> >> with that, for although I face this massive task, I'm actually a graphic
> >> designer, not a programmer.
> >>
> >> I know I will need to design a Wordpress page or post structure to
> >> ultimately contain the various parts of these pages I want to rip.
> >>
> >> But most of all, I need to know how to start.
> >>
> >> 1. I looked at the discussion had here on wp-hackers in February on
> >> "Porting static content". It gives a few leads, but I have questions.
> >>
> >> 2. I looked at the plugin Import HTML Pages. I need to look closer, but
> a
> >> first run attempt on a limited number of files failed. I'm working on a
> >> localhost MAMP set up.
> >>
> >> 3. The consensus from the earlier February discussion was that PHP
> Simple
> >> HTML DOM Parser (http://sourceforge.net/projects/simplehtmldom/) was a
> >> great tool. I looked it, but have no idea how to even start using this.
> I
> >> don't find documentation that speaks to someone as code illiterate as
> me.
> >> Can it be used on a localhost MAMP installation?
> >>
> >> 4. Someone else mentioned PhpQuery (http://code.google.com/p/phpquery/).
> I
> >> haven't the faintest idea how to use that either.
> >>
> >> Basically, I need some kind of heads up here.
> >>
> >> Can someone give a kind of overview of what is possible? Does anyone
> have
> >> any scripts they could share with me that might streamline this and save
> me
> >> from an impossible learning curve?
> >>
> >> I include below some excerpts of the discussion in February. If anyone
> can
> >> elaborate further on some of that, I'd love to read more!
> >>
> >>
> >> Thanks so much for reading, and sorry for not understanding code. My
> head
> >> works in another way.
> >>
> >> best,
> >> John Black
> >>
> >> ______________________
> >>
> >>
> >> Bill Dennen dennen at gmail.com
> >> Wed Feb 23 01:29:35 UTC 2011
> >>
> >> Previous message: [wp-hackers] Porting static content
> >> Next message: [wp-hackers] Porting static content
> >> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> >> >> I've used http://sourceforge.net/projects/simplehtmldom/ a number of
> >> times.
> >>
> >> I've had good luck with that, and phpQuery.
> >>
> >> http://code.google.com/p/phpquery/
> >> phpQuery is a server-side, chainable, CSS3 selector driven Document
> >> Object Model (DOM) API based on jQuery JavaScript Library
> >>
> >> In our case, we first build a sitemap of the pages we wanted to
> >> import. It's basically a hierarchy built as an mult-level, unordered
> >> list. This allows us to maintain parent-child relationships between
> >> pages when they are imported into WordPress. We loop over that
> >> unordered list of links, scrape each page, and use phpQuery to select
> >> different parts of the page based on jQuery selectors. We can also add
> >> custom fields to these imported page using add_post_meta.
> >>
> >> -Bill
> >>
> >> ______________________
> >>
> >>
> >> Christopher Ross cross at thisismyurl.com
> >> Tue Feb 22 22:55:22 UTC 2011
> >>
> >> Previous message: [wp-hackers] Porting static content
> >> Next message: [wp-hackers] Porting static content
> >> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> >> Scot, this sounds like a perfect use of the wp_insert_post() function
> >>
> >>
> >>        $post = array(
> >>          'comment_status' => 'closed',
> >>          'ping_status' => 'closed',
> >>          'post_author' => $authorID,
> >>          'post_category' => $cat,
> >>          'post_content' =>
> >> mysql_real_escape_string($_POST['post_content']),
> >>          'post_excerpt' =>
> >> mysql_real_escape_string($_POST['post_excerpt']),
> >>          'post_status' => $blogpoststatus,
> >>          'post_title' => $posttitle,
> >>          'post_type' => 'post',
> >>          'tags_input' =>  $_POST['post_tags']
> >>        );
> >>        $wpid = wp_insert_post($post);
> >>
> >>
> >> I did a government site a while back with similar restrictions, after
> >> downloading the content to a directory using an offline viewer I simply
> ran
> >> RegEx on the content until I had 50,000 usable documents. After that, I
> >> simply ran a PHP script to pull in 100 pages at time, post to the WP
> >> database and move those files.
> >>
> >> ______________________
> >>
> >>
> >> Keith P. Graham kpgraham at gmail.com
> >> Wed Feb 23 15:00:31 UTC 2011
> >>
> >> Previous message: [wp-hackers] Porting static content
> >> Next message: [wp-hackers] WordPress.org API
> >> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> >> I wrote a simple plugin called Instant-Content-plugin
> >> http://wordpress.org/extend/plugins/instant-content-plugin to import
> >> static files into posts. It uses text files, but you can play with the
> >> code to get that the way you want, commenting out the code that
> >> replaces crlf with br tags. I made it so it can import data from a zip
> >> file of static pages, using the first line of each file as a title.
> >>
> >> I've used this to import large amounts of static data into WP. Most
> >> recently I have been using it to import pages from my older sites into
> >> WP. I had several sites with hundreds of pages of content that just
> >> grew over the years. Each page had a slightly different layout. I
> >> loaded all the pages into Notebook++ and made global changes until I
> >> got just bare content. I then zipped up the files and loaded them into
> >> WP pages using the instant-content plugin.
> >>
> >> Keith
> >>
> >> ______________________
> >>
> >>
> >>
> >> Paul paul at codehooligans.com
> >> Tue Feb 22 22:49:58 UTC 2011
> >>
> >> Previous message: [wp-hackers] Porting static content
> >> Next message: [wp-hackers] Porting static content
> >> Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> >> Scott.
> >>
> >> I've used http://sourceforge.net/projects/simplehtmldom/ a number of
> >> times.
> >>
> >> I'll send you a working script I use to suck down a site.
> >>
> >> P-
> >>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> wp-hackers mailing list
> >> wp-hackers at lists.automattic.com
> >> http://lists.automattic.com/mailman/listinfo/wp-hackers
> >>
> > _______________________________________________
> > wp-hackers mailing list
> > wp-hackers at lists.automattic.com
> > http://lists.automattic.com/mailman/listinfo/wp-hackers
> >
> _______________________________________________
> wp-hackers mailing list
> wp-hackers at lists.automattic.com
> http://lists.automattic.com/mailman/listinfo/wp-hackers
>


More information about the wp-hackers mailing list