[wp-hackers] Measuring "distance" between two web pages

Vinicius Massuchetto viniciusmassuchetto at gmail.com
Tue Sep 18 13:10:06 UTC 2012


There's a plugin for an academic work in machine learning that can do
posts auto-classification based on words, verbs and pronouns
histograms.
https://bitbucket.org/viniciusmassuchetto/bcc-ufpr/src/94d8da2661ec/ci310-tam/trabfinal/self-classifier

Also, the paper (in Portuguese) describes the math expression for
calculating the distance between two posts:
https://bitbucket.org/viniciusmassuchetto/bcc-ufpr/src/94d8da2661ec/ci310-tam/trabfinal/doc/relatorio.pdf

You could easily extend it for HTML pages, giving a higher score to
matching tags.

Cheers.
--
Vinicius Massuchetto
http://vinicius.soylocoporti.org.br


2012/9/18 Kokarn <kokarn at gmail.com>:
> There are a couple of php functions for this exact thing, mainly the
> levenshtein <http://php.net/manual/en/function.levenshtein.php> and
> similar-text <http://www.php.net/manual/en/function.similar-text.php>
>  functions.
>
> I have tried in the past to use them for something similar when i was
> building something along these lines and it's absolutley possible.
>
> There are a bunch of limitations however on how the long the strings can be
> and stuff like that so it might not be optimal.
>
> On 18 September 2012 14:14, Phillip Lord <phillip.lord at newcastle.ac.uk>wrote:
>
>>
>>
>> Add up all the numerical values of all the characters. It's as likely to
>> work as anything. This will, of course, fail badly under some
>> circumstances (a single character in the CSS can change things *a lot*),
>> but will work under others.
>>
>> Another way would be to render the HTML to an image, then compare the
>> images. There are lots of off the shelf things to do this (for instance,
>> "compare" from ImageMagick). This would check the visualisation. Setting
>> a sane threshold might be hard, but who knows till you try it.
>>
>> Otherwise, I think you are asking about automated robot checking which
>> is going to get hard.
>>
>> Phil
>>
>>
>> David Anderson <david at wordshell.net> writes:
>> > This is a more general question than specifically WordPress, but perhaps
>> > someone will have an idea.
>> >
>> > With WordShell (wordshell.net) we've got customers who manage enormous
>> numbers
>> > of sites (mass hosting). They can't be bothered or can't afford to test
>> each
>> > one individually after doing an update from 3.4.1 to 3.4.2.
>> >
>> > WordShell presently visits the site home-page before and after
>> alterations, to
>> > test for HTTP errors. But that's rather basic.
>> >
>> > What I'd be interested in is if anyone knows of any tool or method for a
>> > generalised way of producing a score for a difference between two HTML
>> pages.
>> > i.e. I download the HTML before, and again after - are they "probably"
>> the
>> > same page? They can't be expected to be exactly the same; for one thing,
>> the
>> > "Generator" meta tag will have changed. The page might show the present
>> time,
>> > or have hidden timings in the HTML source, etc.
>> >
>> > So, what I'm interested in is a way to produce a "distance" score
>> between the
>> > two pages. Something so that we can say something like "if the
>> difference is
>> > more than X, then there is a Y% chance that these are not the same
>> page". By
>> > choosing for Y, I can then test on X. Then WordShell will put up a flag
>> based
>> > on that statistical likelihood - "OY! There's a good chance that that
>> update
>> > broke the page!".
>> >
>> > This is quite interesting not just for maintaining WordPress but for any
>> > website, but I haven't come across anything like it - any ideas?
>> >
>> > Many thanks,
>> > David
>>
>> --
>> Phillip Lord,                           Phone: +44 (0) 191 222 7827
>> Lecturer in Bioinformatics,             Email:
>> phillip.lord at newcastle.ac.uk
>> School of Computing Science,
>> http://homepages.cs.ncl.ac.uk/phillip.lord
>> Room 914 Claremont Tower,               skype: russet_apples
>> Newcastle University,                   msn: msn at russet.org.uk
>> NE1 7RU                                 twitter: phillord
>> _______________________________________________
>> wp-hackers mailing list
>> wp-hackers at lists.automattic.com
>> http://lists.automattic.com/mailman/listinfo/wp-hackers
>>
> _______________________________________________
> wp-hackers mailing list
> wp-hackers at lists.automattic.com
> http://lists.automattic.com/mailman/listinfo/wp-hackers


More information about the wp-hackers mailing list