[wp-hackers] Measuring "distance" between two web pages

David Anderson david at wordshell.net
Wed Sep 19 11:13:51 UTC 2012


On 18/09/12 23:06, wp-hackers-request at lists.automattic.com wrote:
> Add up all the numerical values of all the characters. It's as likely to
> work as anything. This will, of course, fail badly under some
> circumstances (a single character in the CSS can change things *a lot*),
> but will work under others.
Having chewed this over, I wonder if an even simpler solution is best. 
Just do a total character count.

A web-page may have rolling content that changes from one view to the 
next. But the total amount of content is likely to stay similar. If a 
page of 60,000 characters in the source turns into a page of 20,000 
after updating a plugin, then something probably went wrong.

Quick-and-dirty test:

# for i in google.com news.bbc.co.uk/sport wordpress.org/extend/plugins 
wordshell.net facebook.com slashdot.org cart66.com planet.wordpress.org; 
do echo "$i: `wget -q -O - http://$i | wc -c` `wget -q -O - http://$i | 
wc -c`"; done

google.com: 11126 11090
news.bbc.co.uk/sport: 142052 142052
wordpress.org/extend/plugins: 18981 18981
wordshell.net: 19152 19152
facebook.com: 20088 19369
slashdot.org: 100756 100756
cart66.com: 27632 27631
planet.wordpress.org: 58082 58088

Testing on the approx. sixty sites I've got in a particular WordShell 
installation:

# for i in `wordshell --listsites | grep Enabled | awk '{print $6}'`; do 
echo -n "`wget -q -O - http://$i | wc -c`/`wget -q -O - http://$i | wc 
-c`, "; done

46572/46572, 8290/8290, 30088/30088, 49527/49526, 2138/2138, 
31326/31326, 12672/12672, 51324/51324, 5422/5422, 19944/19944, 
5201/5201, 27984/27984, 20484/20484, 17875/17875, 11434/11433, 
27210/27210, 19064/19064, 16732/16732, 30075/30075, 5375/5375, 
9747/9747, 6828/6828, 12687/12687, 8925/8925, 40519/40519, 2588/2588, 
4757/4757, 5533/5533, 331/331, 16138/16138, 11578/11578, 13962/13962, 
7636/7636, 8605/8605, 20971/20971, 7230/7230, 11583/11583, 20090/20090, 
26135/26155, 5378/5378, 16532/16532, 4599/4599, 30075/30075, 
20007/20007, 14565/14565, 13726/13726, 11739/11739, 57119/57119, 
19335/19335, 2670/2670, 36404/36404, 58981/58981, 23257/23257, 
14018/14018, 17526/17526, 19152/19152, 32893/32893, 11213/11213, 
9098/9098, 12195/12195, 9394/9394, 9022/9022, 50805/50805, 15063/15063

Of course, that's not pre/post upgrade (and of course too some of those 
sites are running caching plugins, so of course we get the same result), 
but it indicates that, at least on this sample of sites, you're not 
expecting the character count to vary much from one page view to the 
next. I think the max in there is 20 characters, or about 0.1%. Those 
counts also vary a lot from one site to another. So I think larger 
changes in character count before and after a change can be good 
predictors of breakage.

David

-- 
WordShell - WordPress fast from the CLI - www.wordshell.net



More information about the wp-hackers mailing list