Some mor fine tuning on bayes plugin?
-
- Regular
- Posts: 695
- Joined: Tue Jul 03, 2007 3:34 am
- Location: Berlin, Germany
- Contact:
Some mor fine tuning on bayes plugin?
The last days more and more HAM comments are identified as SPAM by my Bayes installation.
Janek had the idea that it might come because as time goes by the DB has a lot more SPAM than HAM entries.
Wouldn't it be a good idea to put a factor in front of the SPAM calculation, that is dependent on the SPAM/HAM factor in the DB, so it is not becoming that sensible? Something like
(HAM entries)/(SPAM entries) * (SPAM factor)
instead of (SPAM factor) only?
Janek had the idea that it might come because as time goes by the DB has a lot more SPAM than HAM entries.
Wouldn't it be a good idea to put a factor in front of the SPAM calculation, that is dependent on the SPAM/HAM factor in the DB, so it is not becoming that sensible? Something like
(HAM entries)/(SPAM entries) * (SPAM factor)
instead of (SPAM factor) only?
Re: Some mor fine tuning on bayes plugin?
Malte, if you get to work on this feature request, could you also please add a function to classify a return int rating for a single sent comment body entry, for the P.o.C. Dashboard? Or isn't this possible to circumvent the eventHook and js placement of the rating...?
Regards,
Ian
Serendipity Styx Edition and additional_plugins @ https://ophian.github.io/ @ https://github.com/ophian
Ian
Serendipity Styx Edition and additional_plugins @ https://ophian.github.io/ @ https://github.com/ophian
Re: Some mor fine tuning on bayes plugin?
@grischa:
If I understand that correctly, that would be a duplication. Classic bayes already takes the amount into account, as the probability that something is spam (P(B)). Sure, that probability raises as more spam is classified, but i fear that the factor you described would seriously mess with the results, not only in the wanted way. Example:
There would be not a chance to recognize spam.
I want to mention here that the amount of spam doesn't necessarily raises that much. Spam that lands in the recycler and is deleted there does not influence the calculation, it is not automatically learned again as spam (to prevent the effect you described). So the filter shouldn't get always stronger and stronger. You should watch the bottom of the database-menu to see if the amount raises, maybe than tthat should get fixed.
Generally: But each of us now has a solid test-database. Maybe it would be a good idea to take some time and test different calculations in an automated way, collecting real test-cases via our blogs. I probably won't have that time anytime soon (that would be a really cool project even worth an academic paper!).
@Ian: There already is :)
function classify($comment = '', $type) does return a number and type can be "body", whereas $comment simply is a string.
If I understand that correctly, that would be a duplication. Classic bayes already takes the amount into account, as the probability that something is spam (P(B)). Sure, that probability raises as more spam is classified, but i fear that the factor you described would seriously mess with the results, not only in the wanted way. Example:
Code: Select all
(1000 hams /10000 spams) * 99 = 9.9 = ham
I want to mention here that the amount of spam doesn't necessarily raises that much. Spam that lands in the recycler and is deleted there does not influence the calculation, it is not automatically learned again as spam (to prevent the effect you described). So the filter shouldn't get always stronger and stronger. You should watch the bottom of the database-menu to see if the amount raises, maybe than tthat should get fixed.
Generally: But each of us now has a solid test-database. Maybe it would be a good idea to take some time and test different calculations in an automated way, collecting real test-cases via our blogs. I probably won't have that time anytime soon (that would be a really cool project even worth an academic paper!).
@Ian: There already is :)
function classify($comment = '', $type) does return a number and type can be "body", whereas $comment simply is a string.
Re: Some mor fine tuning on bayes plugin?
oh nice. I got stuck to startClassify().
Regards,
Ian
Serendipity Styx Edition and additional_plugins @ https://ophian.github.io/ @ https://github.com/ophian
Ian
Serendipity Styx Edition and additional_plugins @ https://ophian.github.io/ @ https://github.com/ophian
-
- Regular
- Posts: 695
- Joined: Tue Jul 03, 2007 3:34 am
- Location: Berlin, Germany
- Contact:
Re: Some mor fine tuning on bayes plugin?
Of course: As I don't have any deeper knowledge of the Bayes algorithm, this was only an example to describe what I mean. Some kind of "pseudo code".onli wrote:Sure, that probability raises as more spam is classified, but i fear that the factor you described would seriously mess with the results, not only in the wanted way.
But it seems to me, that comments are classified as SPAM more easy when the SPAM database is filled more and more. Some factor based on the HAM/SPAM db entry ratio would be cool to correct this.
Re: Some mor fine tuning on bayes plugin?
Did you check wether your db is really that unbalanced? As I described, it does not necessarily have to be that way (e.g. in my blog "body" is evenly distributed, the other fields have way more spam).But it seems to me, that comments are classified as SPAM more easy when the SPAM database is filled more and more
Basic Bayes theory is:
Code: Select all
is the comment spam when it has these words = (the probability till now that spam-comments had these words * the probability that a comment is spam) / the probability of a comment to have these words
Therefore, we could introduce something like this or check the current calculation, but such an operation needs thorough testing.
-
- Regular
- Posts: 695
- Joined: Tue Jul 03, 2007 3:34 am
- Location: Berlin, Germany
- Contact:
Re: Some mor fine tuning on bayes plugin?
onli wrote:Did you check wether your db is really that unbalanced? As I described, it does not necessarily have to be that way (e.g. in my blog "body" is evenly distributed, the other fields have way more spam).
Code: Select all
SELECT SUM(ham),SUM(spam)
FROM `serendipity_spamblock_bayes`
WHERE `type` = 'body'
Code: Select all
SUM(ham) 6460
SUM(spam) 5538550
Re: Some mor fine tuning on bayes plugin?
Indeed, it is ^^
But those are only the tokens. The other number used to calculate P(spam) and P(ham) is shown in the database-menu at the bottom.
But those are only the tokens. The other number used to calculate P(spam) and P(ham) is shown in the database-menu at the bottom.
-
- Regular
- Posts: 695
- Joined: Tue Jul 03, 2007 3:34 am
- Location: Berlin, Germany
- Contact:
Re: Some mor fine tuning on bayes plugin?
It's the same (well, not that unbalanced)
Name Homepage E-Mail IP Referrer Kommentar
Valid Spam Valid Spam Valid Spam Valid Spam Valid Spam Valid Spam
130 24554 126 20639 97 1978 127 2746 100 1571 130 11606
Name Homepage E-Mail IP Referrer Kommentar
Valid Spam Valid Spam Valid Spam Valid Spam Valid Spam Valid Spam
130 24554 126 20639 97 1978 127 2746 100 1571 130 11606
Re: Some mor fine tuning on bayes plugin?
Do you have the auto-learn function on, or why are that so many? Were you used to mark them again as spam when deleting from the recycler?
-
- Regular
- Posts: 695
- Joined: Tue Jul 03, 2007 3:34 am
- Location: Berlin, Germany
- Contact:
Re: Some mor fine tuning on bayes plugin?
Both. I have autolearn on and marked them as SPAM when clearing recycler in times we were allowed to do this.
But I also receive a lot of SPAM comments, of course.
But I also receive a lot of SPAM comments, of course.
Re: Some mor fine tuning on bayes plugin?
Deactivate auto-learn. If your filter already is sharp enough, there is no need to make it sharper. That is why that number is so high.
PS: I wrote about the basics of the filter: http://www.onli-blogging.de/index.php?/ ... lagen.html
PS: I wrote about the basics of the filter: http://www.onli-blogging.de/index.php?/ ... lagen.html