[wp-hackers] Measuring "distance" between two web pages
David Anderson
david at wordshell.net
Wed Sep 19 11:13:51 UTC 2012
On 18/09/12 23:06, wp-hackers-request at lists.automattic.com wrote:
> Add up all the numerical values of all the characters. It's as likely to
> work as anything. This will, of course, fail badly under some
> circumstances (a single character in the CSS can change things *a lot*),
> but will work under others.
Having chewed this over, I wonder if an even simpler solution is best.
Just do a total character count.
A web-page may have rolling content that changes from one view to the
next. But the total amount of content is likely to stay similar. If a
page of 60,000 characters in the source turns into a page of 20,000
after updating a plugin, then something probably went wrong.
Quick-and-dirty test:
# for i in google.com news.bbc.co.uk/sport wordpress.org/extend/plugins
wordshell.net facebook.com slashdot.org cart66.com planet.wordpress.org;
do echo "$i: `wget -q -O - http://$i | wc -c` `wget -q -O - http://$i |
wc -c`"; done
google.com: 11126 11090
news.bbc.co.uk/sport: 142052 142052
wordpress.org/extend/plugins: 18981 18981
wordshell.net: 19152 19152
facebook.com: 20088 19369
slashdot.org: 100756 100756
cart66.com: 27632 27631
planet.wordpress.org: 58082 58088
Testing on the approx. sixty sites I've got in a particular WordShell
installation:
# for i in `wordshell --listsites | grep Enabled | awk '{print $6}'`; do
echo -n "`wget -q -O - http://$i | wc -c`/`wget -q -O - http://$i | wc
-c`, "; done
46572/46572, 8290/8290, 30088/30088, 49527/49526, 2138/2138,
31326/31326, 12672/12672, 51324/51324, 5422/5422, 19944/19944,
5201/5201, 27984/27984, 20484/20484, 17875/17875, 11434/11433,
27210/27210, 19064/19064, 16732/16732, 30075/30075, 5375/5375,
9747/9747, 6828/6828, 12687/12687, 8925/8925, 40519/40519, 2588/2588,
4757/4757, 5533/5533, 331/331, 16138/16138, 11578/11578, 13962/13962,
7636/7636, 8605/8605, 20971/20971, 7230/7230, 11583/11583, 20090/20090,
26135/26155, 5378/5378, 16532/16532, 4599/4599, 30075/30075,
20007/20007, 14565/14565, 13726/13726, 11739/11739, 57119/57119,
19335/19335, 2670/2670, 36404/36404, 58981/58981, 23257/23257,
14018/14018, 17526/17526, 19152/19152, 32893/32893, 11213/11213,
9098/9098, 12195/12195, 9394/9394, 9022/9022, 50805/50805, 15063/15063
Of course, that's not pre/post upgrade (and of course too some of those
sites are running caching plugins, so of course we get the same result),
but it indicates that, at least on this sample of sites, you're not
expecting the character count to vary much from one page view to the
next. I think the max in there is 20 characters, or about 0.1%. Those
counts also vary a lot from one site to another. So I think larger
changes in character count before and after a change can be good
predictors of breakage.
David
--
WordShell - WordPress fast from the CLI - www.wordshell.net
More information about the wp-hackers
mailing list