[wp-hackers] Indexing documents for search

John Blackbourn johnbillion+wp at gmail.com
Sun Oct 30 22:50:38 UTC 2011


On 30 October 2011 21:45, Eric Mann <eric at eam.me> wrote:
> Does anyone have any experience extending WP's search functionality to
> include the content of uploaded, non-DB housed documents?

Last year I wrote a plugin for a client which indexes the contents of
PDFs uploaded into WordPress. It uses one of the many PDF to text PHP
classes available [1]. We had mixed results with its reliability and
accuracy. For example, it's not always possible to extract text from
PDFs created in certain PDF applications, and others may give text all
squished together without spaces where you might expect. Different PHP
classes for the text extraction probably give different results.

Integrating the PDF text extraction with WordPress was simply a case
of hooking into the 'update_attached_file' hook and creating a post
containing the PDF's text which is then searchable as usual in
WordPress. If you're interested in the plugin feel free to email me
off-list.

[1] http://www.google.com/search?q=pdf+to+text+php


More information about the wp-hackers mailing list