Page 1 of 1

Some mor fine tuning on bayes plugin?

Posted: Wed Jun 20, 2012 1:45 pm
by blog.brockha.us
The last days more and more HAM comments are identified as SPAM by my Bayes installation.
Janek had the idea that it might come because as time goes by the DB has a lot more SPAM than HAM entries.

Wouldn't it be a good idea to put a factor in front of the SPAM calculation, that is dependent on the SPAM/HAM factor in the DB, so it is not becoming that sensible? Something like
(HAM entries)/(SPAM entries) * (SPAM factor)
instead of (SPAM factor) only?

Re: Some mor fine tuning on bayes plugin?

Posted: Thu Jun 21, 2012 6:06 pm
by Timbalu
Malte, if you get to work on this feature request, could you also please add a function to classify a return int rating for a single sent comment body entry, for the P.o.C. Dashboard? Or isn't this possible to circumvent the eventHook and js placement of the rating...?

Re: Some mor fine tuning on bayes plugin?

Posted: Thu Jun 21, 2012 10:43 pm
by onli
@grischa:
If I understand that correctly, that would be a duplication. Classic bayes already takes the amount into account, as the probability that something is spam (P(B)). Sure, that probability raises as more spam is classified, but i fear that the factor you described would seriously mess with the results, not only in the wanted way. Example:

Code: Select all

(1000 hams /10000 spams) * 99 = 9.9 = ham
There would be not a chance to recognize spam.

I want to mention here that the amount of spam doesn't necessarily raises that much. Spam that lands in the recycler and is deleted there does not influence the calculation, it is not automatically learned again as spam (to prevent the effect you described). So the filter shouldn't get always stronger and stronger. You should watch the bottom of the database-menu to see if the amount raises, maybe than tthat should get fixed.

Generally: But each of us now has a solid test-database. Maybe it would be a good idea to take some time and test different calculations in an automated way, collecting real test-cases via our blogs. I probably won't have that time anytime soon (that would be a really cool project even worth an academic paper!).

@Ian: There already is :)
function classify($comment = '', $type) does return a number and type can be "body", whereas $comment simply is a string.

Re: Some mor fine tuning on bayes plugin?

Posted: Fri Jun 22, 2012 8:51 am
by Timbalu
oh nice. I got stuck to startClassify().

Re: Some mor fine tuning on bayes plugin?

Posted: Fri Jun 22, 2012 12:09 pm
by blog.brockha.us
onli wrote:Sure, that probability raises as more spam is classified, but i fear that the factor you described would seriously mess with the results, not only in the wanted way.
Of course: As I don't have any deeper knowledge of the Bayes algorithm, this was only an example to describe what I mean. Some kind of "pseudo code".
But it seems to me, that comments are classified as SPAM more easy when the SPAM database is filled more and more. Some factor based on the HAM/SPAM db entry ratio would be cool to correct this.

Re: Some mor fine tuning on bayes plugin?

Posted: Fri Jun 22, 2012 1:33 pm
by onli
But it seems to me, that comments are classified as SPAM more easy when the SPAM database is filled more and more
Did you check wether your db is really that unbalanced? As I described, it does not necessarily have to be that way (e.g. in my blog "body" is evenly distributed, the other fields have way more spam).

Basic Bayes theory is:

Code: Select all

is the comment spam when it has these words = (the probability till now that spam-comments had these words * the probability that a comment is spam) / the probability of a comment to have these words
You can surely see that to calculate this, the total amount of spam and ham-comments is heavily used. Especially the total distribution of spam and ham is important. To mess with that would argue against the logic of the classifier (as you clearly get more spam than ham, the a priori probability of the comment-classification should be biased in that direction). But it is true that such an artificial bias should be and normally is introduced if one wants to make sure not to false-classify a comment as spam (in the bayes-plugin calculation, such factors are blindly taken from the original code, b8).
Therefore, we could introduce something like this or check the current calculation, but such an operation needs thorough testing.

Re: Some mor fine tuning on bayes plugin?

Posted: Fri Jun 22, 2012 1:43 pm
by blog.brockha.us
onli wrote:Did you check wether your db is really that unbalanced? As I described, it does not necessarily have to be that way (e.g. in my blog "body" is evenly distributed, the other fields have way more spam).

Code: Select all

SELECT SUM(ham),SUM(spam)
FROM `serendipity_spamblock_bayes`
WHERE `type` = 'body'

Code: Select all

SUM(ham)   6460
SUM(spam) 5538550
.. I am sure, that it really is that unbalanced .. :mrgreen:

Re: Some mor fine tuning on bayes plugin?

Posted: Fri Jun 22, 2012 1:58 pm
by onli
Indeed, it is ^^
But those are only the tokens. The other number used to calculate P(spam) and P(ham) is shown in the database-menu at the bottom.

Re: Some mor fine tuning on bayes plugin?

Posted: Fri Jun 22, 2012 2:00 pm
by blog.brockha.us
It's the same (well, not that unbalanced)

Name Homepage E-Mail IP Referrer Kommentar
Valid Spam Valid Spam Valid Spam Valid Spam Valid Spam Valid Spam
130 24554 126 20639 97 1978 127 2746 100 1571 130 11606

Re: Some mor fine tuning on bayes plugin?

Posted: Fri Jun 22, 2012 5:05 pm
by onli
Do you have the auto-learn function on, or why are that so many? Were you used to mark them again as spam when deleting from the recycler?

Re: Some mor fine tuning on bayes plugin?

Posted: Fri Jun 22, 2012 5:11 pm
by blog.brockha.us
Both. I have autolearn on and marked them as SPAM when clearing recycler in times we were allowed to do this.

But I also receive a lot of SPAM comments, of course. :)

Re: Some mor fine tuning on bayes plugin?

Posted: Fri Jun 22, 2012 5:31 pm
by onli
Deactivate auto-learn. If your filter already is sharp enough, there is no need to make it sharper. That is why that number is so high.

PS: I wrote about the basics of the filter: http://www.onli-blogging.de/index.php?/ ... lagen.html