[wp-hackers] Help with the API on WordPress.org?
wordpress at dd32.id.au
Fri Jan 2 01:17:05 GMT 2009
2009/1/2 Mike Schinkel <mikeschinkel at gmail.com>
> > i quickly stoped, for 1 reason, Google doesnt like Screen scrapers,
> block you off after 2-3 similar repeditive searches, do it enough times
> (like the plugin search) and you'll get blocked very quickly, Google's good
> at detecting this.
> Good point, but here are my thoughts on that. If you were doing for
> WordPress.org I'd agree, but if doing in a plugin where the user
> will probably only do a few searches at any one time I think they shouldn't
> care so much and that they would have a very hard time to detect. If it
> becomes a problem I could simulate the user's browser using their user
> agent, etc. Since this would be such a low volume thing on a per person
> basis I can't see it causing that much trouble?
What i found was that Volume wasnt the issue, It was relitivly closely
related searches within a certain timeframe, For example, "site:
wordpress.org/extend/plugins" is going to be constant, its just the search
term that changes, Or flicking to the next page faster than what a normal
human would (as you've got to filter out many duplicates from a google
search). Imitating a users browser does nothing.. They'll block a human if a
human does it too fast too.. Similar search queries coming from a single
host can also trip the block.. I've been blocked numerous times when
searching for things when i'm changing the search terms slightly and
flicking through a few pages..
> Also, I'm doing it for my own use but planning to make available for others
> who want to use it.
> > So, Lets use one of the available Google API's! ...Then you realise the
> only ones available only return the first 10 results with no further
> > Actually today you can get 4 or 8 results and then you can get subsequent
> > pages. BUT, the showstopper for me was their lack of "exclude"
> > functionality. Sure I could do with PHP but then I have to add in a JSON
> > library and I decide it would just be easier to parse the HTML.
Ah.. JSON.. And those 4 or 8 pages include duplicates.. things like /plugin/
and /plugin/installation/ that was the showstopper for me with JSON
> > I have a big project coming up next week where I am going to be doing
> > of plugin research so I've got to get this working for my own self ASAP.
> > I'll look into Yahoo but I've got Google working for now and the real
> > problem is the WordPress.org's search and/or API.
> > However, Since its not that complex, There isnt a function to send a list
> slugs and get a list of details back, Only a function to return those
> details for a single plugin -- Mainly because there was no need for that
> kind of functionality in the requirements at the time.. :)
> Ugh :-( I guess I could scrape WordPress.org and cache locally. Yuck.
> > Any chance I could write an alpha version that could be hosted on
> > WordPress.org?
I highly doubt it :) They dont like just running any code.. Needs to be
written to take advantage of bbPress correctly, something which i didnt do
really well.. So there was another wait of a few weeks while someone rewrote
parts of it.. even now, whats on there has been modified by someone else i'm
pretty sure :)
> > BTW, even though it's now what I needed thanks for documenting the API!
I said i'd do it.. And 2 people asking related to it in the same week kicked
me into action :) Need to accomplish something today!
More information about the wp-hackers