[wp-hackers] Measuring "distance" between two web pages
viniciusmassuchetto at gmail.com
Tue Sep 18 13:10:06 UTC 2012
There's a plugin for an academic work in machine learning that can do
posts auto-classification based on words, verbs and pronouns
Also, the paper (in Portuguese) describes the math expression for
calculating the distance between two posts:
You could easily extend it for HTML pages, giving a higher score to
2012/9/18 Kokarn <kokarn at gmail.com>:
> There are a couple of php functions for this exact thing, mainly the
> levenshtein <http://php.net/manual/en/function.levenshtein.php> and
> similar-text <http://www.php.net/manual/en/function.similar-text.php>
> I have tried in the past to use them for something similar when i was
> building something along these lines and it's absolutley possible.
> There are a bunch of limitations however on how the long the strings can be
> and stuff like that so it might not be optimal.
> On 18 September 2012 14:14, Phillip Lord <phillip.lord at newcastle.ac.uk>wrote:
>> Add up all the numerical values of all the characters. It's as likely to
>> work as anything. This will, of course, fail badly under some
>> circumstances (a single character in the CSS can change things *a lot*),
>> but will work under others.
>> Another way would be to render the HTML to an image, then compare the
>> images. There are lots of off the shelf things to do this (for instance,
>> "compare" from ImageMagick). This would check the visualisation. Setting
>> a sane threshold might be hard, but who knows till you try it.
>> Otherwise, I think you are asking about automated robot checking which
>> is going to get hard.
>> David Anderson <david at wordshell.net> writes:
>> > This is a more general question than specifically WordPress, but perhaps
>> > someone will have an idea.
>> > With WordShell (wordshell.net) we've got customers who manage enormous
>> > of sites (mass hosting). They can't be bothered or can't afford to test
>> > one individually after doing an update from 3.4.1 to 3.4.2.
>> > WordShell presently visits the site home-page before and after
>> alterations, to
>> > test for HTTP errors. But that's rather basic.
>> > What I'd be interested in is if anyone knows of any tool or method for a
>> > generalised way of producing a score for a difference between two HTML
>> > i.e. I download the HTML before, and again after - are they "probably"
>> > same page? They can't be expected to be exactly the same; for one thing,
>> > "Generator" meta tag will have changed. The page might show the present
>> > or have hidden timings in the HTML source, etc.
>> > So, what I'm interested in is a way to produce a "distance" score
>> between the
>> > two pages. Something so that we can say something like "if the
>> difference is
>> > more than X, then there is a Y% chance that these are not the same
>> page". By
>> > choosing for Y, I can then test on X. Then WordShell will put up a flag
>> > on that statistical likelihood - "OY! There's a good chance that that
>> > broke the page!".
>> > This is quite interesting not just for maintaining WordPress but for any
>> > website, but I haven't come across anything like it - any ideas?
>> > Many thanks,
>> > David
>> Phillip Lord, Phone: +44 (0) 191 222 7827
>> Lecturer in Bioinformatics, Email:
>> phillip.lord at newcastle.ac.uk
>> School of Computing Science,
>> Room 914 Claremont Tower, skype: russet_apples
>> Newcastle University, msn: msn at russet.org.uk
>> NE1 7RU twitter: phillord
>> wp-hackers mailing list
>> wp-hackers at lists.automattic.com
> wp-hackers mailing list
> wp-hackers at lists.automattic.com
More information about the wp-hackers