Page 1 of 1

Avoiding duplicate content

Posted: Tue Nov 06, 2007 5:03 pm
by jhermanns
I just added the following lines to my template in order for google to only index static pages and entries, but not the overview/archives/search pages. SEO people say that this will increase the likelyhood of all your distinct pages being indexed (so that the overview pages, that duplicate content, don't use up availible "slots" for your domain).

Code: Select all

{if $head_title}
    <meta name="robots" content="index,follow">
{else}
    <meta name="robots" content="noindex,follow">
{/if}
But this should really be either part of s9y's core or a plugin - solving it on template level is kinda stupid. Is there such a plugin? :)


P.S.
I did not use {if $is_single_entry} because that won't catch static pages.

Re: Avoiding duplicate content

Posted: Tue Nov 06, 2007 5:53 pm
by johncanary
There is nothing like "available slots"!
And there is no "Duplicate Content Penality", if you run your blog
  • * in a natural fashion
    * have original content
    * don't steal from other sites
    * don't overload it with advertisment
Only thing that happens is, Google hides result sets that it believes to be
redundant to a specific search. Then those pages are "hidden" behind the
"More Results" Link.

Those categories, monthly, archives, tag pages are just a different mix of the
content.
It can be those pages that get higher ranking for a particular search
term than a single entry page.
  • It is very simple:
    * The more pages you allow to be indexed
    * The more pages will be indexed
    * The more free traffic you will get
Google knows how blogs work and it knows what archives, .... are about.

Don't limit your potential.

You could use the ROBOTS.TXT file to achieve the same effect more easily.

JohnCanary

Posted: Tue Nov 06, 2007 6:02 pm
by Don Chambers
I look forward to any further input on this. I have an idea as to where this COULD be included in a plugin but will only do so if the concept has merit.

Re: Avoiding duplicate content

Posted: Tue Nov 06, 2007 6:24 pm
by jhermanns
johncanary wrote:There is nothing like "available slots"!
And there is no "Duplicate Content Penality", if you run your blog
  • * in a natural fashion
    * have original content
    * don't steal from other sites
    * don't overload it with advertisment
Only thing that happens is, Google hides result sets that it believes to be
redundant to a specific search. Then those pages are "hidden" behind the
"More Results" Link.
I don't think I mentioned penalties. Anyhow, the noindex on the overview page still makes sense to me - having one chunk of content only indexed once seems reasonable to me. So the above code snippet can be interpreted as making sure the right page is being indexed (and later offered as a search result). To ensure that search engines recognize the original source of information (as in "the entry", not some overview page). To ensure that archive/list pages don't push the individual entry pages out of the result list.
johncanary wrote: Those categories, monthly, archives, tag pages are just a different mix of the
content.
It can be those pages that get higher ranking for a particular search
term than a single entry page.
  • It is very simple:
    * The more pages you allow to be indexed
    * The more pages will be indexed
    * The more free traffic you will get
Google knows how blogs work and it knows what archives, .... are about.
Yeah it should - but I don't see any benefit in returning archive pages as search results - all content therein is also located on a single page (which may link to more interesting, related articles). Except when the search query overlaps two blog entries which are then returned together, but - I don't know.
johncanary wrote: Don't limit your potential.
You could use the ROBOTS.TXT file to achieve the same effect more easily.
JohnCanary
Not with less LOC though - especially when you think about static pages :-)

Posted: Tue Nov 06, 2007 7:50 pm
by Don Chambers
Don Chambers wrote:I look forward to any further input on this. I have an idea as to where this COULD be included in a plugin but will only do so if the concept has merit.
Scratch that. After reviewing the intent and code of the plugin I had in mind, I really do not think it is the place to incorporate this concept since it applies to so many different types of pages (entries, overviews, archives, static pages, and other plugin generated non-entry pages).

Re: Avoiding duplicate content

Posted: Tue Nov 06, 2007 10:02 pm
by johncanary
jhermanns wrote:I don't think I mentioned penalties.
You are right, you didn't.

Having an overview page indexed gives you simply more chances that your blog
turns up in some search results (SERP) for some users. That's what I know.

A search engine cares most about giving best matching results to the user. They
do not care so much about original content, especially, if it's on the same
site. Search engines cluster words and phrases, do statistical analysis, ...

What could be a benefit of having more pages (even overviews, ...) indexed?

It simply increases the probablility that a search engine user finds to your
blog. Instead of setting some pages to 'noindex',
I would focus on
  • * Providing a sitemap, which is very effective
    * getting as many inbound links from many different, relevant sources as
    possible. With this it makes much sense to contentrate on the "entries".
I vote for having these three functions in the S9Y core:
  • * Announcement (Ping popular services)
    * Trackback Control
    * Full pingback support (currently only in the development version)
That's the best for publicity. Publicity brings inbound links. Inbound links
are the most effective SEO technique.
jhermanns wrote:
johncanary wrote:You could use the ROBOTS.TXT file to achieve the same effect more easily.
JohnCanary
Not with less LOC though - especially when you think about static pages :-)
LOC ?

Yours
Johncanary

Re: Avoiding duplicate content

Posted: Tue Nov 06, 2007 10:20 pm
by jhermanns
johncanary wrote: Having an overview page indexed gives you simply more chances that your blog
turns up in some search results (SERP) for some users. That's what I know.
Or the overview page could "dominate" for some reason and push the article page out of the search results - and the article page contains links to related blog entries. That's what I tried to say :-)
johncanary wrote:nstead of setting some pages to 'noindex',
I would focus on
  • * Providing a sitemap, which is very effective
    * getting as many inbound links from many different, relevant sources as
    possible. With this it makes much sense to contentrate on the "entries".
I vote for having these three functions in the S9Y core:
  • * Announcement (Ping popular services)
    * Trackback Control
    * Full pingback support (currently only in the development version)
That's the best for publicity. Publicity brings inbound links. Inbound links
are the most effective SEO technique.
How would an accessible Site that is not very huge (as adobe.com) profit from submitting sitemaps? If you search for adobe (on google) e.g. you see the effect of submitting a sitemap for a large site. But for regular sites I don't really see the use...
johncanary wrote:
jhermanns wrote:
johncanary wrote:You could use the ROBOTS.TXT file to achieve the same effect more easily.
JohnCanary
Not with less LOC though - especially when you think about static pages :-)
LOC ?
Lines of Code :-)

Re: Avoiding duplicate content

Posted: Wed Nov 07, 2007 12:36 am
by johncanary
jhermanns wrote:How would an accessible Site that is not very huge (as adobe.com) profit from
submitting sitemaps? If you search for adobe (on google) e.g. you see the effect of
submitting a sitemap for a large site. But for regular sites I don't really see the
use...
Sitemaps reviewed:

PRO:
  • * Use XML sitemaps to inform Google/Yahoo/MSN about updates on your site. Sitemaps are more
    often crawled than your entire site.

    * When maintained by a Content Management System like Serendipity it is updated
    automatically without any additional effort and search engines are pinged actively
    upon the page update.

    * It gets you more pages of your site into the index more quickly.
CONTRA:
  • * Sitemaps are used by copyscrapers to more easily find content to steal.
    * You need to set it up, but that doesn't really count.
That's my experience.

You are right that it might be too much hassle for smaller sites, or sites that don't
change very often. I also use plain text sitemap files for simplicity in such
cases. But even those help to get pages indexed more easily.

I don't announce the sitemaps in the ROBOTS.TXT file, and I don't use the
default filenames to make it at least a bit more difficult for copyscrapers.

P.S.: If you want to find Adobe, search for "Click here" (at least on Google.com).
:shock:

Posted: Tue Nov 13, 2007 3:39 am
by Don Chambers
Jannis - how has this modification impacted your search results since you implemented the change?

Posted: Sat Dec 15, 2007 8:32 pm
by jhermanns
i have not yet checked the impacts, but I have updated the code:

Code: Select all

    {if ($view == "entry" || $view == "plugin") && $smarty.server.REQUEST_URI|truncate:17:"" != "/daily/plugin/tag"}
    <meta name="robots" content="index,follow" />
    {else}
    <meta name="robots" content="noindex,follow" />
    {/if}
:-)

Posted: Sat Dec 15, 2007 9:39 pm
by Don Chambers
Keep us posted if you experience any noticeable change in search results. I'm really curious.

Posted: Sun Dec 16, 2007 8:24 am
by carl_galloway
Hey Jannis,

Got a quick question, why are you truncating 17 characters. Looking at your domain 14 seems like a better number (ie http://jann.is), care to explain more?

Carl

Posted: Sun Dec 16, 2007 1:29 pm
by jhermanns
hey carl,

the $smarty.server.REQUEST_URI variable does not contain the protocol and hostname, what I am truncating is a string like /daily/plugin/tag/sometag.

And because I only want the single entries and the static pages to be indexed, I had to add this check. Because the $view=="plugin" check is true not only for pages generated by the static page plugin, but also for pages created by the freetag plugin.

So I truncate the REQUEST_URI to the first 17 characters: the length of /daily/plugin/tag. That adds "noindex" to any url that belongs to the freetag plugin.

Posted: Sun Dec 16, 2007 2:09 pm
by carl_galloway
Awesome, thanks for the info :D

Posted: Tue Aug 19, 2008 5:19 pm
by Michele2
This is exactly what I was looking for - a way to not have Google index search results pages. On my blog, which is about crafts and marketing them, I see no reason for Google to index search pages for the words "wisely", "remiss", "teensy", "sacrificed" and other equally useless terms. Seems ridiculous at best, spammy at worst.

Right now I think I have more search pages indexed than entries. :cry:

One major drawback that I see is that the code also marks the homepage as noindex, nofollow. I changed the code a bit to have the homepage index, follow...

Code: Select all

{if $head_title}
    <meta name="robots" content="index,follow">
{elseif $startpage}
    <meta name="robots" content="index,follow">
{else}
    <meta name="robots" content="noindex,follow">
{/if}