How Do Search Engine Robots Work?
September 12, 2007
In webmasterworld there is a discussion about how do search engine robots work, Pageoneresults made some frequently asked questions:
1. Do robots accept cookies?
2. What happens if my site forces a cookie?
3. Do robots execute JavaScript functions?
4. Could I be doing something technically that is stopping a robot from indexing my site?
5. How do robots interpret my page?
6. In what order to robots index my page? What is the very first step that robot takes?
Before answering those questions, I would like to define what is “Search engine robots”, many newbie webmasters or those that want to be a webmaster don’t know what is a search engine spider. A search engine spider is also called web crawler.
Definition by wikipedia:
A web crawler (also known as a Web spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000).
This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).
A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.
Let me try my best to answer these questions:
1. Do robots accept cookies?
Normally a search engine spider will not accept cookies. Most web crawlers aim is to collect and read free content. Let’s make some examples, a forum. If you have a forum and you have some sections that is only readable if the user login (after you login you will be inserted a cookie) the spider will not read those sections of the forum.
2. What happens if my site forces a cookie?
If a page requires a cookie (the forum example). It will definitevely see nothing. As mentioned before a spider will read your site if it is allowed. Take a spider as a guest member in your site/forum.
3. Do robots execute JavaScript functions?
The server is the one the executes the functions anyway so to present a html document to the search engines and user. Therefore search engines do not need to execute functions - they see what the end user sees. Functions can be executed either server side or client side. ASP, PHP and script and functions written in other server side languages will execute on the server on the request of a specific URL - no matter who or what requests it. Scripts or functions written in client side JavaScript or other client side scripting languages will execute in the client (browser) if that client is set to do so and support the language. Most normal browsers support JavaScript but most spiders don’t.
Spisders often read strings inside a client side script to see if they can find meaningful words or full URLs but they generally do not execute the script.
4. Could I be doing something technically that is stopping a robot from indexing my site?
You could stop search engine spiders to index your site or some parts of you site. Many webmasters suggest that you SHOULD because there are some files and pages that you don’t want the spiders to index, for example in a wordpress blog everything related to wp- (wp-content, wp-admin,etc..) you should stop them. This can be done with your robots.txt or .htaccess
5. How do robots interpret my page?
Every bot interpret your page different but there are some rules that they follow. Most of them, read your page as a human does. They start from the header, body and footer. Then they follow the anchor text links, and they give priority on “h” tags (h1, h2, h3…) and also bold, italic, etc.. tags.
6. In what order to robots index my page? What is the very first step that robot takes?
As mentioned before Header, Body, Footer. The first step is to get the title of your site and then continues.
Want One of the Cheapest and Affordable Hosting?
What Next?
Digg It
Save This Page
Sphinn It
Stumble it!
Favorite This Post

Posted in 


content rss