[wp-hackers] Measuring "distance" between two web pages

Phillip Lord phillip.lord at newcastle.ac.uk
Tue Sep 18 12:14:39 UTC 2012



Add up all the numerical values of all the characters. It's as likely to
work as anything. This will, of course, fail badly under some
circumstances (a single character in the CSS can change things *a lot*),
but will work under others. 

Another way would be to render the HTML to an image, then compare the
images. There are lots of off the shelf things to do this (for instance,
"compare" from ImageMagick). This would check the visualisation. Setting
a sane threshold might be hard, but who knows till you try it. 

Otherwise, I think you are asking about automated robot checking which
is going to get hard. 

Phil


David Anderson <david at wordshell.net> writes:
> This is a more general question than specifically WordPress, but perhaps
> someone will have an idea.
>
> With WordShell (wordshell.net) we've got customers who manage enormous numbers
> of sites (mass hosting). They can't be bothered or can't afford to test each
> one individually after doing an update from 3.4.1 to 3.4.2.
>
> WordShell presently visits the site home-page before and after alterations, to
> test for HTTP errors. But that's rather basic.
>
> What I'd be interested in is if anyone knows of any tool or method for a
> generalised way of producing a score for a difference between two HTML pages.
> i.e. I download the HTML before, and again after - are they "probably" the
> same page? They can't be expected to be exactly the same; for one thing, the
> "Generator" meta tag will have changed. The page might show the present time,
> or have hidden timings in the HTML source, etc.
>
> So, what I'm interested in is a way to produce a "distance" score between the
> two pages. Something so that we can say something like "if the difference is
> more than X, then there is a Y% chance that these are not the same page". By
> choosing for Y, I can then test on X. Then WordShell will put up a flag based
> on that statistical likelihood - "OY! There's a good chance that that update
> broke the page!".
>
> This is quite interesting not just for maintaining WordPress but for any
> website, but I haven't come across anything like it - any ideas?
>
> Many thanks,
> David

-- 
Phillip Lord,                           Phone: +44 (0) 191 222 7827
Lecturer in Bioinformatics,             Email: phillip.lord at newcastle.ac.uk
School of Computing Science,            http://homepages.cs.ncl.ac.uk/phillip.lord
Room 914 Claremont Tower,               skype: russet_apples
Newcastle University,                   msn: msn at russet.org.uk
NE1 7RU                                 twitter: phillord


More information about the wp-hackers mailing list