mod_rewrite and 404

Random stuff about serendipity. Discussion, Questions, Paraphernalia.
Post Reply
DB
Regular
Posts: 22
Joined: Mon May 01, 2006 9:40 pm
Contact:

mod_rewrite and 404

Post by DB »

Why is that when mod_rewrite is enabled, you end up losing your 404 on stuff like:
blog/1-Category/this/directory/does/not/exist.html

It simply loads up the '1-Category' page like all of those '/' don't even exist. I mean, they don't, but it isn't a category, or a post, or even a page. So can't this just take us to a 404 page?

In other words, how can be made to respect freakishly weird category names, but yet still know when to take us to a 404 page:
RewriteRule ^(categories/([0-9]+)-[0-9a-z\.\_!;,\+\-\%]+) unite.html?/$1 [NC,L,QSA]

Can anyone think of how to do this? What would it take to make it recognize a second instance of '/'?
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: mod_rewrite and 404

Post by garvinhicking »

Hi!

Yes, this is a fallback because s9y detects the Category URL and sets the parameters to that category. This is a feature, actually - you can't really disable that, it's wired deep in the code.

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
Don Chambers
Regular
Posts: 3652
Joined: Mon Feb 13, 2006 2:40 am
Location: Chicago, IL, USA
Contact:

Re: mod_rewrite and 404

Post by Don Chambers »

You could do 404 of your own like this inside index.tpl of your template:

Code: Select all

{if $view=='404'}
    YOUR CODE HERE
{else}
    {$CONTENT}
{/if}
=Don=
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: mod_rewrite and 404

Post by garvinhicking »

Hi Don!

Actually, I don't think this applies here. 404 only happens when no URL rule is matched (like /blalbla/sdfsdf.html). Here, the /categories/x---/ does match to the usual category style, so s9y thinks this is perfectly valid and jumps to the "category view" branch in index.php. 404 would only be reached in instances where no URL check matches.

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
Don Chambers
Regular
Posts: 3652
Joined: Mon Feb 13, 2006 2:40 am
Location: Chicago, IL, USA
Contact:

Re: mod_rewrite and 404

Post by Don Chambers »

Good point Garvin - I should have read the original post a bit more carefully.
=Don=
DB
Regular
Posts: 22
Joined: Mon May 01, 2006 9:40 pm
Contact:

Re: mod_rewrite and 404

Post by DB »

garvinhicking wrote: Yes, this is a fallback because s9y detects the Category URL and sets the parameters to that category. This is a feature, actually - you can't really disable that, it's wired deep in the code.
It would be nice (more of a feature) if the requested URL was cleansed, and maybe checked against the actual categories that exist. If this/is/some/crazy/url.html existed, then it would take you to that category page. If it doesn't match up, then you go to a "Page not Found".

So:

../blog/1-Category would take us to a real page

../blog/1-Category/any/old/random/endless/useless/thing would in fact take us to a 404 page

Maybe I am the only one who is not enjoying the benefits of this feature.
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: mod_rewrite and 404

Post by garvinhicking »

Ho!
It would be nice (more of a feature) if the requested URL was cleansed, and maybe checked against the actual categories that exist. If this/is/some/crazy/url.html existed, then it would take you to that category page. If it doesn't match up, then you go to a "Page not Found".
That's really hard, because the Categories and other s9y links can take all sort of parameters (Authorid, Timerange, Freetags, etc.) and plugins can have an effect on it. I wouldn't really know a way to "scrub" the URL and validate it in the beginning.

In which regard is this behavior a problem?

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
DB
Regular
Posts: 22
Joined: Mon May 01, 2006 9:40 pm
Contact:

Re: mod_rewrite and 404

Post by DB »

DB wrote:In which regard is this behavior a problem?
In the regard that the 404 Page isn't doing it's job anymore.

Totally read to let this go, just curious is all.

Rant ensuing...

It has been brought up on more than one occasion in my experience when using rewrites.

One of my s9y setups did have a bout with it here (http://board.s9y.org/viewtopic.php?f=2&t=14063), and changing all my links from relative to absolute, as you had pointed out, seemed to take care of the problem I was having (spiders were hitting links such as /1-Category/login/css/print/terms/login/signup/login/support.html, and these URLs would go on for days, gigabytes worth). Maybe this would still occur though, seems that I do still have relative links, many of those being created via the admin panel wysiwyg. *Sorry if this might be double-post. The previous issue seems resolved, and this was more of a general question.

I assist in administering several non-s9y website's that produce all sections/pages of the sites dynamically using rewrites. I have been asked more than once why a 404 is not produced when non-existing directory is hit. It might be that someone just re-worked their entire website, so there are plenty of URLs out there that may now correlate to a non-existent place. I find that most of the people I work with, who pay lots of $ to SEO companies, request that any 'now non-existent' page must take you to a 404. The Page Not Found can have some "we just redesigned our site" type of language, etc. There are usually a few regex patterns that can be added to the rewrites to attempt to deal with any foreseen circumstances.

In terms of the s9y setup, I just find it odd that this open-ended URL stuff seems somehow okay. There is no check done on these types of URLs. Not even to match it up to something under a 100 chars, or even under 1000000 chars. Anything goes. The slashes in the URLs refer to a directory structure... until you get to a category page, then all of that doesn't matter anymore. The the open-ended, maybe endless, URL is a feature.

The entries rewrite, although not checked to see if the entry actually exists, at least does not allow for non-existent directories:

Code: Select all

#no endless directory here
RewriteRule ^((archives/([0-9]+)-[0-9a-z\.\_!;,\+\-\%]+\.html)/?) unite.html?/$1 [NC,L,QSA]

Categories definetly doesn't check for any slashes:

Code: Select all

#endless directory here though
RewriteRule ^(categories/([0-9]+)-[0-9a-z\.\_!;,\+\-\%]+) unite.html?/$1 [NC,L,QSA]

Something like this might work:

Code: Select all

# could stop with something like this, may not be perfect for all instances
RewriteRule ^(categories/([0-9]+)-[0-9a-z\.\_!;,\+\-\%]+)/?$ unite.html?/$1 [NC,L,QSA]
RewriteRule ^(categories/([0-9]+)-[0-9a-z\.\_!;,\+\-\%]+)/([P0-9]+).html$ unite.html?/$1 [NC,L,QSA]
By the way, while not being cleansed for length, are they at least cleansed for javascript, etc? I'm assuming they are, but haven't actually looked at the code.
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: mod_rewrite and 404

Post by garvinhicking »

Hi!

Thanks for your in depth explanation. I now have a better sight about why this could impose trouble for you. However, the s9y rewrite rules must allow nearly everything inside the URL to allow any plugin to "grab" parts of the URL and add to them. Thus, we need to allow all alpha-numerical letterings inside the URL, and once any part of that matches a global s9y permalink, the page is deemed valid. I really wouldn't know an easy way to cut this that wouldn't affect the functionality of a lot of s9y plugins and customizations, I'm sorry.

THis is one part where flexibility to deal with future possibilities interferes with the wish to have a very strict URL parsing. You can't really have both, and in my experiences as a blog readers it's better for me to see content of parts of the URL that apply to what I want (i.e. a specific category) than to get to a generic 404 page.

So your suggested regexp change would no longer allow to view entries in 2008 in category 1: /categories/1-mycategory/2008, or even to view Page 2 of all postings by the first author in the 1st category: /category/1-mycategory/A1/P2.html

SEO-wise, wouldn't having content on a page be better than not having it?
By the way, while not being cleansed for length, are they at least cleansed for javascript, etc? I'm assuming they are, but haven't actually looked at the code.
Yes, they are. :-) (At least there is no known possibility to me to inject this, as we use htmlspecialchars() on the output everywhere)

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
DB
Regular
Posts: 22
Joined: Mon May 01, 2006 9:40 pm
Contact:

Re: mod_rewrite and 404

Post by DB »

garvinhicking wrote:SEO-wise, wouldn't having content on a page be better than not having it?
From my research, and what those particular clients tell me, is that no content IS better than having duplicate content. Not positive though on that though.

Either way, sounds like disabling mod_rewrite on my s9y blog would be best option, if it came down to it.

Thanks
-DB
Post Reply