Showing posts with label Robots.txt. Show all posts
Showing posts with label Robots.txt. Show all posts

Meta Tags, Meta Robots, and Robots.txt

 

What is Meta Tag?

Meta elements are HTML or XHTML elements used to provide structured metadata about a Web page. Such elements must be placed as tags in website head element.

The meta tag has two uses: either to emulate the use of the HTTP response header, or to embed additional metadata within the HTML document.

Example of the use of the Meta Tag


<meta http-equiv="Content-Type" content="text/html" >
<meta name="keywords" content="Your Website Keywords" >
<meta name="Description" content="Your Website Description" > 


How Meta  Robots used in search engine optimization?

Meta tag provide information about a given Web page, most often to help search engines spider categorize them correctly. They are embed into the HTML document, but are often not directly visible to a user visiting the site.

They have been the focus of a field of search engine optimization, where different methods are explored to provide a user's site with a higher ranking on search engines organic ranking. In the mid to late 1990s, search engines bot were reliant on meta tag to correctly classify a Web page and webmasters quickly learned the commercial significance of having the right meta element, as it frequently led to a high ranking in the search engines — and thus, high traffic to the website.

As search engine organic traffic achieved greater significance in internet marketing plans, consultants were brought in who were well versed in search engine behavior for seo. These consultants used a variety of techniques to improve ranking for their clients websites.

While search engine optimization can improve search engine organic ranking, consumers of such services should be careful to employ only reputable providers. Given the extraordinary competition and man who practices a craft with great skills required for top search engine organic placement, the implication of the term "search engine optimization" has Become progressively worse over the last decade. Where it once implied bringing a website to the top of a search engine's organic results page, for some consumers it now implies a relationship with keyword spamming or optimizing a site's internal search engine for improved performance.

Major search engine crawler are more likely to quantify such extant factors as the volume of incoming links from related websites, quantity and quality of content, technical precision of source code, spelling, functional v. broken hyperlinks, volume and consistency of searches and/or viewer traffic, time within website, page views, revisits, click-through, technical user-features, uniqueness, redundancy, relevance, advertising revenue yield, freshness, geography, language and other intrinsic characteristics.

Useful Meta  Robots for SEO

  • author
    Who wrote this Web page? You can include a list of authors if multiple people wrote the content and it typically refers to the content authors rather than the designers of the HTML or CSS.
    <meta name="author" content="author name" />
     
  • copyright
    Set the copyright date on the document. Note, you shouldn't use this instead of a copyright notice that is visible on the Web page, but it's a good place to store the copyright in the code as well.
    <meta name="copyright" content="© 2008 Jennifer Kyrnin and About.com" />
     
  • contact
    This is a contact email address for the author of the page (generally). Be aware that if you put an email address in this tag, it can be read by spammers, so be sure to protect your email address.
    <meta name="contact" content="email address" />
     
  • last-modified
    When was this document last edited?
    <meta http-equiv="last-modified" content="YYYY-MM-DD@hh:mm:ss TMZ" />

Meta  Robots for Communicating with the Web Browser or Server

These meta tags provide information to the Web server and any Web browsers that visit the page. In many cases, the browsers and servers can take action based on these meta tags.
  • cache-control
    Control how your pages are cached. The options you have are: public (default) - allows the page to be cached; private - the page may only be cached in private caches; no-cache - the page should never be cached; no-store - the page may be cached but not archived.
    <meta http-equiv="cache-control" content="no-cache" />
     
  • content-language
    Define the natural language(s) used on the Web page. Use the ISO 639-1 language codes. Separate multiple languages with commas.
    <meta http-equiv="content-language" content="en,fr" />
     
  • content-type
    This meta tag defines the character set that is used on this Web page. Unless you know that you're using a different charset, I recommend you set your Web pages to use UTF-8.
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
     
  • expires
    If the content of your page has an expiration date, you can specify this in your meta data. This is most often used by servers and browsers that cache content. If the content is expired, they will load the page from the server rather than the cache. To force this, you should set the value to "0", otherwise use the format YYYY-MM-DD@hh:mm:ss TMZ.
    <meta http-equiv="expires" content="0" />
     
  • pragma
    The pragma meta tag is the other cache control tag you should use if you don't want your Web page cached. You should use both meta tags to prevent your Web page being cached.
    <meta http-equiv="pragma" content="no-cache" />

Control Robots with Meta  Robots

There are two meta tags that can help you control how Web robots access your Web page.
  • robots
    This tag tells the Web robots whether they are allowed to index and archive this Web page. You can include any or all of the following keywords (separated by commas) to control what the robots do: all (default) - the robots can do anything on the page; none - robots can do nothing; index - robots should include this page in the index; noindex - robots should not include this page in the index; follow - robots should follow the links on this page; nofollow - robots should not follow links on this page; noarchive - Google uses this to prevent the page from being archived.
    <meta name="robots" content="noindex,nofollow" />
     
  • googlebot
    Google has their own robot - GoogleBot, and they would prefer that you use the googlebot meta tag to control the Googlebot. You can use the following keywords to control the Googlebot: noarchive - Google will not display cached content; nosnippet - Google will not display excerpts or cached content; noindex - Google will not index the page; nofollow - Google will not follow the links on the page.
    <meta name="googlebot" content="nosnippet,nofollow" />
     

What is Robots.txt?

Web site owners use the robots.txt file to give instructions about their site to web robots/crawler/bot; this is called The Robots Exclusion Protocol(REP).

It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:

Handy Robots.txt Cheat Sheet


This example tells all robots to visit all files because the wildcard * specifies all robots:

User-agent: *
Disallow:

This example tells all robots to stay out of a website:

User-agent: *
Disallow: /

The next is an example that tells all robots not to enter four directories of a website:

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

Example that tells a specific robot not to enter one specific directory:

User-agent: BadBot # replace 'BadBot' with the actual user-agent of the bot
Disallow: /private/

Example that tells all robots not to enter one specific file:

User-agent: *
Disallow: /directory/file.html

Note that all other files in the specified directory will be processed.

Example demonstrating how comments can be used:

# Comments appear after the "#" symbol at the start of a line, or after a directive User-agent: * # match all bots
Disallow: / # keep them out

Example demonstrating how to add the parameter to tell bots where the Sitemap is located
User-agent: *
Sitemap: http://www.example.com/sitemap.xml  # tell the bots where your sitemap is located

Nonstandard extensions
Crawl-delay directive

Several major crawlers support a Crawl-delay parameter, set to the number of seconds to wait between successive requests to the same server:[4][5][6]

User-agent: *
Crawl-delay: 10
Allow directive

Some major crawlers support an Allow directive which can counteract a following Disallow directive. This is useful when one tells robots to avoid an entire directory but still wants some HTML documents in that directory crawled and indexed. While by standard implementation the first matching robots.txt pattern always wins, Google's implementation differs in that Allow patterns with equal or more characters in the directive path win over a matching Disallow pattern. Bing uses the Allow or Disallow directive which is the most specific.

In order to be compatible to all robots, if one wants to allow single files inside an otherwise disallowed directory, it is necessary to place the Allow directive(s) first, followed by the Disallow, for example:

Allow: /folder1/myfile.html
Disallow: /folder1/

This example will Disallow anything in /folder1/ except /folder1/myfile.html, since the latter will match first. In case of Google, though, the order is not important.


Sitemap
Some crawlers support a Sitemap directive, allowing multiple Sitemaps in the same robots.txt in the form:

Sitemap: http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml
Sitemap: http://www.google.com/hostednews/sitemap_index.xml


Universal "*" match
The Robot Exclusion Standard does not mention anything about the "*" character in the Disallow: statement. Some crawlers like Googlebot and Slurp recognize strings containing "*", while MSNbot and Teoma interpret it in different ways.