[wp-hackers] HTML Purifier

Edward Z. Yang edwardzyang at thewritingpot.com
Mon Feb 12 21:35:58 GMT 2007


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi, this is the author of HTML Purifier. I'm glad to hear you discussing
HTML Purifier as a possible alternative.

> The primary downside I see to this is the size/number of files. KSES
> is small and effective as a security filter, while HTML Purifier is
> bigger and can do a whole lot more.

Yes, HTML Purifier is guilty of the large number of files, but kses is
by no means "effective", see
<http://hp.jpsband.org/comparison.html#kses> for the problems I found in
it.

Would offering a "single huge monster file" help out in any way?

> One possible benefit, something we accomplished on wordpress.com with
> a customized csstidy library, would be the ability to sanitize inline
> CSS.

HTML Purifier actually already rolls its own CSS parser and validator,
so it's not necessary.

> Does anyone have experience with integrating or profiling HTML Purifier?
> Per?

I actually hacked up a little plugin for WordPress, I'm not so sure how
well it works (I don't use WordPress for a blog, although I have it
installed on my own machine), but it seems to be functional. Comments on
it would be appreciated:

<?php
/*
Plugin Name: HTML Purifier
Version: 1.0.0beta
Plugin URI: http://hp.jpsband.org/
Description: Sends blog posts through a standards-compliant HTML filter,
HTMLPurifier.  Standards-compliant output, guaranteed!
Author: Edward Z. Yang
*/

// include the library file
set_include_path(
    // change this to the path to your installation of HTML Purifier
    '/Documents and Settings/Edward/My Documents/My
Webs/htmlpurifier/library'
    . PATH_SEPARATOR . get_include_path()
);
require_once 'HTMLPurifier.php';

function wordpress_htmlpurifier($text) {
    static $purifier = null;
    if ($purifier === null) $purifier = new HTMLPurifier();

    // ugly hack, since content_save_pre doesn't have strip-slashed content
    static $magic_quotes = null;
    if ($magic_quotes === null) $magic_quotes = get_magic_quotes_gpc();

    if ($magic_quotes) $text = stripslashes($text);

    // preserve magic comments
    $magic_comments = array('more', 'nextpage', 'noteaser');
    foreach ($magic_comments as $name) {
        $text = str_replace("<!--$name-->", "<br class=\"wp-$name\" />",
$text);
    }

    // do our stuff
    $text = $purifier->purify($text);

    foreach ($magic_comments as $name) {
        $text = str_replace("<br class=\"wp-$name\" />", "<!--$name-->",
$text);
    }
    if ($magic_quotes) $text = addslashes($text);

    // the original text is lost, I don't like that very much.
    // PreFormatted <http://vapourtrails.ca/wp-preformatted> might
    // be able to help you
    return $text;
}

add_filter('content_save_pre', 'wordpress_htmlpurifier', 100);
// if you're outputting data from the post_content_filtered data,
// you might want to use this
// add_filter('content_filtered_save_pre', 'wordpress_htmlpurifier', 100);

// disable client-side filtering. As a general rule, client-side
// filtering shouldn't be trusted, so we won't make the attempt at all.
function wordpress_mce_allow_all() { return '*[*]'; }
add_filter('mce_valid_elements', 'wordpress_mce_allow_all');

// disable balanceTags, this is core HTML Purifier functionality
remove_filter('content_save_pre', 'balanceTags');

// disable auto-paragraphing, this can easily break advanced HTML
remove_filter('the_content', 'wpautop');
// as long as this filter runs before HTML Purifier, you could use it:
// add_filter('content_filtered_save_pre', 'wpautop');

// disable kses filtering: HTML Purifier is a kses replacement!
remove_filter('content_save_pre', 'wp_filter_post_kses');
remove_filter('content_filtered_save_pre', 'wp_filter_post_kses');

// We decided to keep some filters since they do things that
// HTML Purifier does and are fairly safe. But you may still want
// to replace them. Here they are:

// wptexturize:
//  Applies typographic corrections and stray ampersand corrections.
//  Much of it is redundant, though the dash conversions may be
//  appreciated.
// remove_filter('the_content', 'wptexturize');

?>

- --
 Edward Z. Yang      Personal: edwardzyang at thewritingpot.com
 SN:Ambush Commander Website: http://www.thewritingpot.com/
 GPGKey:0x869C48DA   http://www.thewritingpot.com/gpgpubkey.asc
 3FA8 E9A9 7385 B691 A6FC B3CB A933 BE7D 869C 48DA
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFF0N29qTO+fYacSNoRAswnAJ48GSuGJ4fW0ZP8enAWNgTl/Dn3lwCeJqe6
lADdwXJSIevw5iCqzCkp49o=
=wPja
-----END PGP SIGNATURE-----


More information about the wp-hackers mailing list