[wp-hackers] Overriding get_posts() behaviour

Christian Aust christian at wilde-welt.de
Sun Jul 3 14:08:22 GMT 2005


Hi Bill,

Am 03.07.2005 um 14:48 schrieb ml_wordpress at copperleaf.org:

> I've also taken a look at Denis' plugin and have a few ideas that 
> maybe you guys could add. I've modified Denis' plugin on my testbed 
> just for fun and could send you the code if you wish. Here is a list 
> of ideas:
>
> 1) Probably the most useful function for me was that I added a filter 
> to the sem_search_index function that allows additional plugins to add 
> additional words to the node_content column. This could actually be a 
> foundation filter for the plugin in that all search words could be 
> added that way: the_content, the_title, post_tags, and, in my case, 
> data from columns in new tables.

Done by lucene: A 'document' is a generic description of the data 
stored in the index, and consists of a number of fields. Fields can be 
indexed and/or stored as full text, searchable or auxiliary fields.

> 2) I found that by using a fulltext search, is you search on 'bean', 
> it won't match 'beans'. I don't know if there is a something in the 
> fulltext search that can allow you to do 'like' queries.

That's exactly what goes by the name of "stemming": A stemmer 
transforms the non-stopword words to their 'base' format, i.e. 
singular. "Houses" will become "house", "women" will be stored as 
"woman". The text finally stored might look funny - but is fully 
searchable, provided that the query will be transformed alike.

Obviously, stemmers are language dependend. An english stemmer won't 
produce too smart results when used with a german text. Lucene provides 
a number of stemmers.

> 3) I added some code that would clean out all funky characters, remove 
> all duplicates and collapse all whitespace in the node_content column. 
> This can shorten the size of the field significantly and removing the 
> dups is nice if you aren't doing weighted searches. Something else to 
> consider would be to remove all stopwords. (Configurable from an admin 
> page?)

Every language has a number of well-known stopwords. Filtering these 
out is usually done together with stemming.

> 4) One last idea is that perhaps an option could be so store the 
> soundex (or some other algorithm) for the word list so that searches 
> are done on that instead of the actual word.

see: Stemming.

> Anyway, like I said above, I'd be glad to send you guys my mods or I'd 
> be glad to help with parts of this if you wish.

My main problem would be: How can I start a process from PHP that forks 
and stays resident? How could those processes communicate efficiently? 
The search itself is actually quite trivial, since lucene is so well 
thought out. Regards,

-   Christian

--

Christian Aust
http://publicvoidblog.de/  -  mailto:christian at wilde-welt.de
icq: 84500990 - Yahoo!: datenimperator - MSN: datenimperator



More information about the wp-hackers mailing list