Some mor fine tuning on bayes plugin?

Creating and modifying plugins.
Post Reply
blog.brockha.us
Regular
Posts: 695
Joined: Tue Jul 03, 2007 3:34 am
Location: Berlin, Germany
Contact:

Some mor fine tuning on bayes plugin?

Post by blog.brockha.us »

The last days more and more HAM comments are identified as SPAM by my Bayes installation.
Janek had the idea that it might come because as time goes by the DB has a lot more SPAM than HAM entries.

Wouldn't it be a good idea to put a factor in front of the SPAM calculation, that is dependent on the SPAM/HAM factor in the DB, so it is not becoming that sensible? Something like
(HAM entries)/(SPAM entries) * (SPAM factor)
instead of (SPAM factor) only?
- Grischa Brockhaus - http://blog.brockha.us
- Want to make me happy? http://wishes.brockha.us/
Timbalu
Regular
Posts: 4598
Joined: Sun May 02, 2004 3:04 pm

Re: Some mor fine tuning on bayes plugin?

Post by Timbalu »

Malte, if you get to work on this feature request, could you also please add a function to classify a return int rating for a single sent comment body entry, for the P.o.C. Dashboard? Or isn't this possible to circumvent the eventHook and js placement of the rating...?
Regards,
Ian

Serendipity Styx Edition and additional_plugins @ https://ophian.github.io/ @ https://github.com/ophian
onli
Regular
Posts: 2825
Joined: Tue Sep 09, 2008 10:04 pm
Contact:

Re: Some mor fine tuning on bayes plugin?

Post by onli »

@grischa:
If I understand that correctly, that would be a duplication. Classic bayes already takes the amount into account, as the probability that something is spam (P(B)). Sure, that probability raises as more spam is classified, but i fear that the factor you described would seriously mess with the results, not only in the wanted way. Example:

Code: Select all

(1000 hams /10000 spams) * 99 = 9.9 = ham
There would be not a chance to recognize spam.

I want to mention here that the amount of spam doesn't necessarily raises that much. Spam that lands in the recycler and is deleted there does not influence the calculation, it is not automatically learned again as spam (to prevent the effect you described). So the filter shouldn't get always stronger and stronger. You should watch the bottom of the database-menu to see if the amount raises, maybe than tthat should get fixed.

Generally: But each of us now has a solid test-database. Maybe it would be a good idea to take some time and test different calculations in an automated way, collecting real test-cases via our blogs. I probably won't have that time anytime soon (that would be a really cool project even worth an academic paper!).

@Ian: There already is :)
function classify($comment = '', $type) does return a number and type can be "body", whereas $comment simply is a string.
Timbalu
Regular
Posts: 4598
Joined: Sun May 02, 2004 3:04 pm

Re: Some mor fine tuning on bayes plugin?

Post by Timbalu »

oh nice. I got stuck to startClassify().
Regards,
Ian

Serendipity Styx Edition and additional_plugins @ https://ophian.github.io/ @ https://github.com/ophian
blog.brockha.us
Regular
Posts: 695
Joined: Tue Jul 03, 2007 3:34 am
Location: Berlin, Germany
Contact:

Re: Some mor fine tuning on bayes plugin?

Post by blog.brockha.us »

onli wrote:Sure, that probability raises as more spam is classified, but i fear that the factor you described would seriously mess with the results, not only in the wanted way.
Of course: As I don't have any deeper knowledge of the Bayes algorithm, this was only an example to describe what I mean. Some kind of "pseudo code".
But it seems to me, that comments are classified as SPAM more easy when the SPAM database is filled more and more. Some factor based on the HAM/SPAM db entry ratio would be cool to correct this.
- Grischa Brockhaus - http://blog.brockha.us
- Want to make me happy? http://wishes.brockha.us/
onli
Regular
Posts: 2825
Joined: Tue Sep 09, 2008 10:04 pm
Contact:

Re: Some mor fine tuning on bayes plugin?

Post by onli »

But it seems to me, that comments are classified as SPAM more easy when the SPAM database is filled more and more
Did you check wether your db is really that unbalanced? As I described, it does not necessarily have to be that way (e.g. in my blog "body" is evenly distributed, the other fields have way more spam).

Basic Bayes theory is:

Code: Select all

is the comment spam when it has these words = (the probability till now that spam-comments had these words * the probability that a comment is spam) / the probability of a comment to have these words
You can surely see that to calculate this, the total amount of spam and ham-comments is heavily used. Especially the total distribution of spam and ham is important. To mess with that would argue against the logic of the classifier (as you clearly get more spam than ham, the a priori probability of the comment-classification should be biased in that direction). But it is true that such an artificial bias should be and normally is introduced if one wants to make sure not to false-classify a comment as spam (in the bayes-plugin calculation, such factors are blindly taken from the original code, b8).
Therefore, we could introduce something like this or check the current calculation, but such an operation needs thorough testing.
blog.brockha.us
Regular
Posts: 695
Joined: Tue Jul 03, 2007 3:34 am
Location: Berlin, Germany
Contact:

Re: Some mor fine tuning on bayes plugin?

Post by blog.brockha.us »

onli wrote:Did you check wether your db is really that unbalanced? As I described, it does not necessarily have to be that way (e.g. in my blog "body" is evenly distributed, the other fields have way more spam).

Code: Select all

SELECT SUM(ham),SUM(spam)
FROM `serendipity_spamblock_bayes`
WHERE `type` = 'body'

Code: Select all

SUM(ham)   6460
SUM(spam) 5538550
.. I am sure, that it really is that unbalanced .. :mrgreen:
- Grischa Brockhaus - http://blog.brockha.us
- Want to make me happy? http://wishes.brockha.us/
onli
Regular
Posts: 2825
Joined: Tue Sep 09, 2008 10:04 pm
Contact:

Re: Some mor fine tuning on bayes plugin?

Post by onli »

Indeed, it is ^^
But those are only the tokens. The other number used to calculate P(spam) and P(ham) is shown in the database-menu at the bottom.
blog.brockha.us
Regular
Posts: 695
Joined: Tue Jul 03, 2007 3:34 am
Location: Berlin, Germany
Contact:

Re: Some mor fine tuning on bayes plugin?

Post by blog.brockha.us »

It's the same (well, not that unbalanced)

Name Homepage E-Mail IP Referrer Kommentar
Valid Spam Valid Spam Valid Spam Valid Spam Valid Spam Valid Spam
130 24554 126 20639 97 1978 127 2746 100 1571 130 11606
- Grischa Brockhaus - http://blog.brockha.us
- Want to make me happy? http://wishes.brockha.us/
onli
Regular
Posts: 2825
Joined: Tue Sep 09, 2008 10:04 pm
Contact:

Re: Some mor fine tuning on bayes plugin?

Post by onli »

Do you have the auto-learn function on, or why are that so many? Were you used to mark them again as spam when deleting from the recycler?
blog.brockha.us
Regular
Posts: 695
Joined: Tue Jul 03, 2007 3:34 am
Location: Berlin, Germany
Contact:

Re: Some mor fine tuning on bayes plugin?

Post by blog.brockha.us »

Both. I have autolearn on and marked them as SPAM when clearing recycler in times we were allowed to do this.

But I also receive a lot of SPAM comments, of course. :)
- Grischa Brockhaus - http://blog.brockha.us
- Want to make me happy? http://wishes.brockha.us/
onli
Regular
Posts: 2825
Joined: Tue Sep 09, 2008 10:04 pm
Contact:

Re: Some mor fine tuning on bayes plugin?

Post by onli »

Deactivate auto-learn. If your filter already is sharp enough, there is no need to make it sharper. That is why that number is so high.

PS: I wrote about the basics of the filter: http://www.onli-blogging.de/index.php?/ ... lagen.html
Post Reply