[spam-stopper] Spam-stopper Back-end

Wed Sep 21 05:58:52 UTC 2005

I have looked at the code quite thoroughly, noticing that URL's of sites get
recorded. This introduces the interesting possibility of getting spam
statistics, as well as the possibility of tracking blogs. By all means, it is
not something that bothers me.

I also notice that filter output is binary, i.e. accept or reject. A
Spamassassim-like ranking mechanism is worth considering in the future (at the
cost of complexity). Regardless:

I am curious as to what happens at the back-end. Is it something which ought to
remain undisclosed at present?

	$comment['user_ip']    = $_SERVER['REMOTE_ADDR'];
	$comment['user_agent'] = $_SERVER['HTTP_USER_AGENT'];
	$comment['referrer']   = $_SERVER['HTTP_REFERER'];
	$comment['blog']       = get_option('home');

I am assuming that this is at most information one can gather. Whether you use a
Bayesian approach I don't know. How do you train your filters? Are you building
a neural network? Linear discriminant analysis? Support vector machines?
Boosting? All such choices can make a tremendous difference in the long-term,
especially if yet another service was to compete with yours. By changing your
filter some day 'along the way' you throw away what you already have. Stay away
from lock-ins (layering, abstractionand long testing phase are key points). To
get the least false positives [1], you need to make some important choices:

- The training algorithm
- The criteria (parameters) used in training, e.g. should you even care about
user-agent, or will it just add noise to our model? The spammer can easily
spoof these; not the case with IP addresses.
- Localisation or 'fragmentation' of filters, e.g. Chinese blogs will need
difference filters from these of an American. Mix the two and you will weaken
the model's specificity and generalisation ability.

[1] Spaminator, for example, can really puts off the user when it intercepts
(READ: not flags or re-directs) genuine contributions. False positive are a
popularity killer, especially due to the /volume/ of comment spam, where once
has to eye-scan.

Roy

PS - This time I proofread, so I hope there will be no typos as in my previous
message.

-- 
Roy S. Schestowitz      | Useless fact: There are five regular polyhedra
http://Schestowitz.com  |    SuSE Linux    |     PGP-Key: 74572E8E
  6:55am  up 26 days 19:09,  3 users,  load average: 0.04, 0.10, 0.17