Firefox PHP

Hack for better spidering.

Posted by brianlmoon 
Hack for better spidering.
June 26, 2003 12:47AM
This hack will remove the ? and & from the Phorum urls so that in theory spiders will bring your server to its knees as they traverse through your Phorum. Sorry, I meant spider your Phorum urls. ;)

This is a very rare hack from me. I don't really do hacks. Your server will need to support PATH_INFO, which basically means Apache. enjoy.

SEE MY POST LATER IN THIS THREAD FOR A NEW VERSION OF THIS HACK.

mike567
Re: Hack for better spidering.
July 29, 2003 01:51PM
Undoubtedly, such a modification would be a great advantage. Thus, I immediately installed the modification. However, after this modification it was impossible to reply to messages. I did not figure out where the problem was, but I had to undo the modification.
Thus, it is a great attempt, but at least in my case, it did not work.
Re: Hack for better spidering.
August 08, 2003 12:02PM
This mod breaks everything on 3.4.4 - when replying to a message, "preview" is broken, and "post" doesn't work at all. In addition, extra chars are added to the end of every page, including mangled footers.
Re: Hack for better spidering.
August 08, 2003 04:29PM
Ok, I really tested this one better. It only changes the read.php, list.php and index.php urls as those are all you need for spidering. It also changes and relative urls in the HTML to be absolute urls as browsers see a / and make assumptions.

Attachments:
open | download - urlshack-2.txt (1.2 KB)
Re: Hack for better spidering.
August 08, 2003 07:22PM
Nice! So far, no problems that I can find - I'll keep beating up on this one and let you know if anything breaks.

This really should be included in the next release of Phorum, being able to have your forums spidered and indexed is important...

Greg
Anonymous User
Re: Hack for better spidering.
August 09, 2003 04:31AM
no, its not important and will not be included.
as brian already told, a spider going through all your forums and posts can bring your server to its knees in a second.
everyone who really needs it can get it as the hack here.

ts77 wrote:
> no, its not important and will not be included.
> as brian already told, a spider going through all your forums
> and posts can bring your server to its knees in a second.

Hi, I'm new here, be gentle with me.... ;-)

I'm not a Phorum user yet, but I soon will be. I truly hope that the standard distribution will become search engine friendly, since I'm not feeling skilled enough to patch the distribution. :-(

Thomas makes a good point. I hope that I can offer a good compromise.

Spiders can be told not to crawl a URL structure. The simplest way would seem to be a "<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">" tag in each page. It should also be noted that a admin can always turn off spidering for all or parts of his site, with a robots.txt file in the root of the site.

If Phorum gets a configuration option to allow spidering, and the installation default is to disallow, then everyone should be happy...

I'd really like to turn on spidering, for two reasons:
- It allows people to find the information they need. In my case, I'll use Phorum for a tech support forum. If people cannot find forum posts via Google et cetera, then some people could have problems that remain unsolved, even though a published fix exists. This obviously doesn't serve the greater good.
- It can draw visitors to a site, by getting good content spidered by the search engines which leads to more search hits.

BTW, more spiders do follow 'machine-generated' URLs, so resistance may be futile, or protection against spiders may be a useful feature for some...

The optimal solution IMHO looks like so:
- a configuration option to allow/disallow spidering. Default value is disallow.
- search engine friendly URLs by default. (Is there a problem here with changing URLs for existing forums?)
- a noindex, nofollow meta tag when set to disallow.
- conversely a index, follow meta tag when set to allow.

I know, "Fix it yourself" as Linus Thorvalds is fond of saying. As noted, that's just not always so simple for a non-programmer. :-)

There is a bit more info on allowing/disallowing spiders here:
[www.searchengineworld.com]
[www.searchengineworld.com]

Thanks for your consideration,

Jesper Mortensen
Anonymous User
Re: Hack for better spidering.
October 06, 2003 04:36PM
>I'd really like to turn on spidering, for two reasons:
>- It allows people to find the information they need. In my case, I'll use Phorum for a >tech support forum. If people cannot find forum posts via Google et cetera, then some >people could have problems that remain unsolved, even though a published fix exists. >This obviously doesn't serve the greater good.

thats what the search-facility in phorum itself is for.

> - search engine friendly URLs by default.

No, as that method doesn't work in all supported environments afaik.

ts77 wrote:
> thats what the search-facility in phorum itself is for.

If/when you know of the forums existence.

> > - search engine friendly URLs by default.
> No, as that method doesn't work in all supported environments
> afaik.

Meta tags may not work 100%, but robots.txt should.

It's your software, so it's your decision. But please do think it over again, and note one strong vote for optionally allowing spidering.

/Jesper Mortensen
Dear Thomas,

you're right when saying "... as that method doesn't work in all supported environments ...", this
is absolutely true and the function imho shouldn't get default so far.
But what I would wish / love to have is a preparation of phorum like the "file-ending-thing" (.php|.html|etc).
It might be done with an additional line of configuration on the same configuration-page (where the
file endings get specified).
It might look like this:
$config_string = "?f={forum_id}&i={message_id}&t={thread_id}&v={thread_view}"; # actual "default"-config

function calculate_url($f, $i, $m, $t, $v) { // this generates the URL
  $url = $config_string;
  $url = ereg_replace("{forum_id}", "$f", $url);
  $url = ereg_replace("{message_id}", $m, $url);
  $url = ereg_replace("{thread_id}", $t, $url);
  $url = ereg_replace("{thread_view}", $v, $url);

  // additional calculations ...
  $url = $phorum_hostname . $phorum_dir . $phorum_filename . $url;
  return $url;
}
With $config_string = "/f/{forum_id}/i/{message_id}/t/{thread_id}/v/{thread_view}.html" a URL could easily get rewritten ;)

The scary things about "dragging down performance" of a phorum are mostly nonsense! Nearly all (commercial) search engines (e.g. Google) take realy care, in not doing this (using timeouts between page requests, even htdig has such a feature).

I also would strongly recommend mod_rewrite doing the address-translation on serverside (if using apache), as it's quite faster than processing $_SERVER['PATH_INFO'] in PHP ...


But there might be some more agressive bots out there, which get a case for an robots.txt-exclusion if they don't behave :-o

Greetings from Germany,
Michael

Re: Hack for better spidering.
November 24, 2003 06:09PM
Phorum 5 builds all urls in a function. You should be able to easily do as you please with urls in Phorum 5.

ChrisB
Re: Hack for better spidering.
January 11, 2004 08:38AM
I installed the hack from brianlmoon but now the cookies function don't work anymore.
Waht settings do I need in .htaccess?
Klaus Weber
Re: Hack for better spidering.
January 13, 2004 07:40PM
I have the same problem. Any ideas?
Re: Hack for better spidering.
January 14, 2004 01:35AM
Yeah, I had that problem too. You will need to find all the setcookie calls and set their dir parameter to your phorum directory.

SGi
Re: Hack for better spidering.
February 12, 2004 10:56AM
Hi!
Nice hack, very nice. Very usefull.

Does the "find all the setcookie calls and set their dir parameter to your phorum directory" solution solved the cookie problem for everyone?
I mean, using this mod and setting the directory right Phorum becomes 100% search engine crawleable and still remains 100% functional?
And does it works fine with release 3.4.6?

BTW, I do agree with Jesper Mortensen above.

Thanks!

Re: Hack for better spidering.
February 16, 2004 03:54PM
Yes, the forum will be functional and more spider friendly.

FWIW, Google does not ignore pages with ? in them and most other new search engines don't either. They do however, put more space (time) between hitting pages on the same site with the same page (stuff before the ?) than others to keep from slamming your server

Re: Hack for better spidering.
March 16, 2004 02:33PM
Is there a way to use this hack in 3.4.6?
Re: Hack for better spidering.
March 30, 2004 04:16AM
Yes, you're totally right!
Re: Hack for better spidering.
April 05, 2004 11:46AM
Now google is able to catch an URL like : [phorum.org] [in the past time, it was only unable to catch two things like :[phorum.org]].

Try in Goggle with the request :
"SEE MY POST LATER IN THIS THREAD FOR A NEW VERSION OF THIS HACK."
You will get : [phorum.org]

Now, Google is able to go through all your forums.

Thomas Seifert wrote :
"a spider going through all your forums and posts can bring your server to its knees in a second" . Is it really the case ? Anyone got a problem with this ?
Re: Hack for better spidering.
May 15, 2004 10:01PM
Hi,

Ur hack did fine on my phroum 3.4.4

Only a minor bug is there. The next post and previous post links on the right side top of a post doesnt work now. Please help me if you can get some time.
Sorry, you do not have permission to post/reply in this forum.