[wp-hackers] Overriding get_posts() behaviour
Christian Aust
christian at wilde-welt.de
Sun Jul 3 14:08:22 GMT 2005
Hi Bill,
Am 03.07.2005 um 14:48 schrieb ml_wordpress at copperleaf.org:
> I've also taken a look at Denis' plugin and have a few ideas that
> maybe you guys could add. I've modified Denis' plugin on my testbed
> just for fun and could send you the code if you wish. Here is a list
> of ideas:
>
> 1) Probably the most useful function for me was that I added a filter
> to the sem_search_index function that allows additional plugins to add
> additional words to the node_content column. This could actually be a
> foundation filter for the plugin in that all search words could be
> added that way: the_content, the_title, post_tags, and, in my case,
> data from columns in new tables.
Done by lucene: A 'document' is a generic description of the data
stored in the index, and consists of a number of fields. Fields can be
indexed and/or stored as full text, searchable or auxiliary fields.
> 2) I found that by using a fulltext search, is you search on 'bean',
> it won't match 'beans'. I don't know if there is a something in the
> fulltext search that can allow you to do 'like' queries.
That's exactly what goes by the name of "stemming": A stemmer
transforms the non-stopword words to their 'base' format, i.e.
singular. "Houses" will become "house", "women" will be stored as
"woman". The text finally stored might look funny - but is fully
searchable, provided that the query will be transformed alike.
Obviously, stemmers are language dependend. An english stemmer won't
produce too smart results when used with a german text. Lucene provides
a number of stemmers.
> 3) I added some code that would clean out all funky characters, remove
> all duplicates and collapse all whitespace in the node_content column.
> This can shorten the size of the field significantly and removing the
> dups is nice if you aren't doing weighted searches. Something else to
> consider would be to remove all stopwords. (Configurable from an admin
> page?)
Every language has a number of well-known stopwords. Filtering these
out is usually done together with stemming.
> 4) One last idea is that perhaps an option could be so store the
> soundex (or some other algorithm) for the word list so that searches
> are done on that instead of the actual word.
see: Stemming.
> Anyway, like I said above, I'd be glad to send you guys my mods or I'd
> be glad to help with parts of this if you wish.
My main problem would be: How can I start a process from PHP that forks
and stays resident? How could those processes communicate efficiently?
The search itself is actually quite trivial, since lucene is so well
thought out. Regards,
- Christian
--
Christian Aust
http://publicvoidblog.de/ - mailto:christian at wilde-welt.de
icq: 84500990 - Yahoo!: datenimperator - MSN: datenimperator
More information about the wp-hackers
mailing list