[wp-hackers] Measuring "distance" between two web pages

Kokarn kokarn at gmail.com
Tue Sep 18 12:18:34 UTC 2012


There are a couple of php functions for this exact thing, mainly the
levenshtein <http://php.net/manual/en/function.levenshtein.php> and
similar-text <http://www.php.net/manual/en/function.similar-text.php>
 functions.

I have tried in the past to use them for something similar when i was
building something along these lines and it's absolutley possible.

There are a bunch of limitations however on how the long the strings can be
and stuff like that so it might not be optimal.

On 18 September 2012 14:14, Phillip Lord <phillip.lord at newcastle.ac.uk>wrote:

>
>
> Add up all the numerical values of all the characters. It's as likely to
> work as anything. This will, of course, fail badly under some
> circumstances (a single character in the CSS can change things *a lot*),
> but will work under others.
>
> Another way would be to render the HTML to an image, then compare the
> images. There are lots of off the shelf things to do this (for instance,
> "compare" from ImageMagick). This would check the visualisation. Setting
> a sane threshold might be hard, but who knows till you try it.
>
> Otherwise, I think you are asking about automated robot checking which
> is going to get hard.
>
> Phil
>
>
> David Anderson <david at wordshell.net> writes:
> > This is a more general question than specifically WordPress, but perhaps
> > someone will have an idea.
> >
> > With WordShell (wordshell.net) we've got customers who manage enormous
> numbers
> > of sites (mass hosting). They can't be bothered or can't afford to test
> each
> > one individually after doing an update from 3.4.1 to 3.4.2.
> >
> > WordShell presently visits the site home-page before and after
> alterations, to
> > test for HTTP errors. But that's rather basic.
> >
> > What I'd be interested in is if anyone knows of any tool or method for a
> > generalised way of producing a score for a difference between two HTML
> pages.
> > i.e. I download the HTML before, and again after - are they "probably"
> the
> > same page? They can't be expected to be exactly the same; for one thing,
> the
> > "Generator" meta tag will have changed. The page might show the present
> time,
> > or have hidden timings in the HTML source, etc.
> >
> > So, what I'm interested in is a way to produce a "distance" score
> between the
> > two pages. Something so that we can say something like "if the
> difference is
> > more than X, then there is a Y% chance that these are not the same
> page". By
> > choosing for Y, I can then test on X. Then WordShell will put up a flag
> based
> > on that statistical likelihood - "OY! There's a good chance that that
> update
> > broke the page!".
> >
> > This is quite interesting not just for maintaining WordPress but for any
> > website, but I haven't come across anything like it - any ideas?
> >
> > Many thanks,
> > David
>
> --
> Phillip Lord,                           Phone: +44 (0) 191 222 7827
> Lecturer in Bioinformatics,             Email:
> phillip.lord at newcastle.ac.uk
> School of Computing Science,
> http://homepages.cs.ncl.ac.uk/phillip.lord
> Room 914 Claremont Tower,               skype: russet_apples
> Newcastle University,                   msn: msn at russet.org.uk
> NE1 7RU                                 twitter: phillord
> _______________________________________________
> wp-hackers mailing list
> wp-hackers at lists.automattic.com
> http://lists.automattic.com/mailman/listinfo/wp-hackers
>


More information about the wp-hackers mailing list