Firefox PHP

Spam Hurdles Module (CAPTCHA's and other anti-spam tools)

Posted by Maurice Makaay 
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
January 27, 2007 11:21AM
I installed the new version (database vs files) of Spam Hurdles Module and it was perfect like a mmakaay module: clean, easy, one click install and... working at once. This is the good side of the module. The bad one is that as well as we avoid the thousands of thousands of files generated by the previous version, we simply transfer the problem to the database. Of course, it is much easier to manage the database than the files and folders profusion. After two days running, the table size was stabilized around 800 recordings taking 25 Mo of space. It's too much, even if I don't have to care about this, I have enough space on the HD. For exemple, the forum_messages table, with 24.000 recordings takes a little less space than the forum_spamhurdles table (24 Mo). Of course there is no relation from one to other, but it is not logical that a protection system takes more place than the thing it is supposed to protect. I think that there are too much information that is stored. I don't know if all this info is necessary, but having a look on it, I see that a lot of this is redundant. Perhaps an other schema of the recordings could avoid this redondance. The other problem is that since I installed the new version, I get some "mysql too many connexions" errors every morning and of course my MySql server is down. I think this comes from the spiders scan of the forums which mysql can't serve them as quickly as it should because at eache spider's visit, it not only has to read the messages data, but also write the spamhurdles table. This gives too much work to my server.

Best regards, and hope this comment may be useful for the next revision of the module
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
January 29, 2007 10:24AM
As for the number of connections: the module does not open any connections. It just uses the connection that is already opened by Phorum for doing all other queries. So to me it sounds a bit strange that the module would give you problems with the connection limit. Why is your mysql server down because of the number of connections? (I don't get the "of course" in that statement).

There's indeed a lot of data cached. This is done for a reason. This way the generating of captcha's won't have to run on every request. Only the first request will have to load and run the modules for generating the info that is needed. Subsequent requests will use the cached data. The first thing to do if you think too much data is used, is to tweak the module to keep the data less long in the database. You can read back in this thread for information on how to tweak the mod for this. The most useful parameter to tweak is the TTL for the spamhurdles data, which can be found in the defaults file.


Maurice Makaay
Phorum Development Team
my blog linkedin profile secret sauce
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
January 29, 2007 11:29PM
a properly tuned mysql server wont have any troubles with this at all...

you can get tuning-primer.sh from [www.day32.com]

read through the script so you know what it's doing, then, if you're comfortable with it, run the script.

it gives some very useful tuning information that will help you a lot.


the data maintained by this module for a given user should be cached .. the only reason it would give slowness is if you dont have enough cache and every db read/write is having to access the disk...
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
January 31, 2007 10:17AM
Quote
Makaay
(I don't get the "of course" in that statement).
Don't care about that. It was the day of "of course", it's used at least three times in my message. But as curious as it sounds I'm not of those who think things going naturally, of course :)

All, I wanted to say,is that the module gives a lot of work to the server, especialy when the forums are visited by spiders. This doesn't mean that it is responsable of puting MySql down. I have a lot of others problems and Spam Hurdles just adds one more (thanks freedman for the tip). Anyway, it is much better this way from, than previously with the files version, at least from the users point of view, perhaps from the technicks side the file system is better, I'm not able to compare.

So what could be improved?

There are two problems in my opinion:
- spiders activity. Do they create captcha's ? I think yes even I'm not sure. Maurice could confirm that ? If yes, a solution like the one that the oscommerce.com guys gave should be possible. The problem there was SID (sessions) generated by spiders which added the session_id to the indexed urls, causing this way a lot of problems. So they make a test through a spiders list (a text file) and if the httpreferer is a spider it doesn't create a session_id.
- redundant data. No idea how to reduce them. Is it possible to avoid to store the entire html/javascript code ? This should only economize some bits. What is the long javascript encoded text in the database records ? Is it the captcha's picture ? What is the logic ? A captcha for any message (or thread ?) for any visitor ?

Have a nice day.
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
January 31, 2007 10:44AM
Some short answers:

* Yes, spiders generate captcha's. If somebody has a trusted source for spider IP-addresses, then adding functionality for not running spam hurdles for those is easy.

* Anything is possible, so not storing redundant data too.

* The javascript encoded text is not an image. Images are generated outside. To see what the encoded data is, you will have to read the module's code.

* The logic is that a captcha is generated whenever an editor is visible on screen. The captcha is linked to that editor. So yes, people clicking through a lot of messages will generate a lot of captchas. I have been thinking of linking captcha's to visitors and reuse them if the go to another page without posting a message, but that's only a faint idea.


Maurice Makaay
Phorum Development Team
my blog linkedin profile secret sauce
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
January 31, 2007 08:35PM
Quote
mmakaay
I have been thinking of linking captcha's to visitors and reuse them if the go to another page without posting a message, but that's only a faint idea.
this would be terrible...
some of the spammers hire people in areas of the world where labor is cheap and human beings do the spamming.. more sophisticated ones just present a screen with the captcha so a person types along and then the autobot finishes up.

if it's only to save database temporary storage, you could, theoretically, generate, say, 100 captcha's, and if any are older than some time-limit, then regenerate those captchas.. then just apply the captcha's at random or round-robin to messages.

it would be valuable if it were OPTIONAL to only require one captcha per session.
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
January 31, 2007 11:37PM
I think you're misreading the comment here. The important part:
"If the(y) go to another page without posting a message"
If they do post a message, of course the captcha would be invalidated at once. As long as they don't, there's no need for generating a new unique captcha in the database though. So this would still be a system where 1 captcha could be used for validating exactly 1 action.


Maurice Makaay
Phorum Development Team
my blog linkedin profile secret sauce
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
February 01, 2007 02:22AM
Quote
makaay
Yes, spiders generate captcha's. If somebody has a trusted source for spider IP-addresses
I've send you a link by PM to a system that doesn't use IP but referal words to detect spiders.
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
February 01, 2007 11:08AM
Like I replied in the PM:
It's possible for spammers to set their user agent field. Therefore the user agent field should never be trusted. Especially if set by a spammer who wants to skip the captcha.

But based on this, I did get an idea for a totally separate module. I think I'll hack up a mod that will check if the user agent field is matching a bot's name. If yes, then it virtually closes down viewed threads, so replying is not available, nor possible. That might be useful in general, to run less code for spiders. I'll have to check the Spam Hurdles mod to see if the hurdles are skipped on closed threads (I think they currently aren't, because the number of closed threads is normally so low that it wouldn't improve much if I'd skip the hurdle checks).


Maurice Makaay
Phorum Development Team
my blog linkedin profile secret sauce
Re: Spam Hurdles Module (CAPTCHA's and other anti-spam tools)
February 01, 2007 05:51PM
Quote
mmakaay
But based on this, I did get an idea for a totally separate module. I think I'll hack up a mod that will check if the user agent field is matching a bot's name. If yes, then it virtually closes down viewed threads, so replying is not available, nor possible. That might be useful in general, to run less code for spiders. I'll have to check the Spam Hurdles mod to see if the hurdles are skipped on closed threads (I think they currently aren't, because the number of closed threads is normally so low that it wouldn't improve much if I'd skip the hurdle checks).

this is a teriffic idea...
one one site I manage, they have these 'tips of the day' which go back a number of years.
for the internal site search engine, the page date is set to the original date of the tip so people searching within the site can easily realize how old and possibly useless the information is, but for spiders it's set to the current date... this makes things look new and fresh for the search engines.

I still think you could re-use even already used captcha's if the goal is to limit the number of captchas in the database on a heavily used phorum...although my suggestion is that another stick of memory would make more sense :P
Sorry, only registered users may post in this forum.

Click here to login