A Web Crawler is a computer program that automatically browses the World Wide Web in a methodical way. Web Crawlers is also called ant, bot, worm or Web spider. The process of scanning the WWW is called Web crawling or spidering.
What Web Crawlers do?
Web Crawling is used by Search engines to provide up-to-date data to the users. What Web Crawlers essentially do is to create a copy of all the visited pages for later processing by a Search Engine. The search engine will then index the downloaded pages in order to provide fast searches.
Web Crawlers are also used for automating tasks on websites such as checking links or validating HTML code.
A Web crawler usually starts with a list of URLs to visit (called the seeds). As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit (crawl frontier). URLs from the frontier are then recursively visited according to a set of policies.
Here is a picture that I did to show you the architecture of a Web Crawler:
Not all Web Crawlers are meant to help users though! since crawlers can also be used to gather specific information from web pages, they are often used to harvest e-mail addresses from web pages to use for spam.
Googlebot is an example of crawler and it’s the Web Crowler used by Google to collect documents from the web to build a searchable index for the Google search engine. Indeed, Googlebot uses algorithmic software to determine which sites to crawl, how often, and how many pages to fetch from each site. In order to achieve this, Google uses a huge set of servers to “crawl” billions of website pages all over the web.
As a Web Crawler Googlebot begins with a list of webpage URLs (generated from previous crawl processes) and also uses information from the Sitemap files provided by webmasters. Googlebot then detects links on each visited page and adds them to its list of pages to crawl. During this process, new sites, changes to existing sites, and dead links are noted and used to update the Google index.
Googlebot maintains a massive index of all the words it sees and their location on each page. Additionally, it is able also to process information in content tags and attributes (i.e. Title tags and ALT attributes). Googlebot cannot process though all content types (i.e. it cannot process the content of some rich media files or dynamic pages).
When a user enters a query, the Search Engine searches the index for matching pages and returns the most relevant results, determined by over 200 factors, as for example the PageRank for a given page. PageRank is the measure of the importance of a page and it’s based on the incoming links from other pages.
I summarized these steps in the following picture:
How to block Googlebot
If you want to block Googlebot (or other crawlers) from accessing and indexing the information on your website, you can add appropriate directives in your robots.txt file, or by simply adding a meta tag to your webpage. In particular, the requests from Googlebot to Web servers include a user-agent string “Googlebot” and host address “googlebot.com”.
Fetch as Googlebot
If you want to see what Googlebot sees when it accesses your page, Google has provided a “Fetch as Googlebot” feature that gives the ability for users to submit pages and get real time feedback on what Googlebot sees. “Fetch as Googlebot” comes useful if users re-implement their site or find out that some of their web pages have been hacked, or want to understand why they’re not ranking for specific keywords.
If you want to use ‘Fetch as Googlebot‘ all you have to do is:
1. Login to Webmaster Tools
2. Select your site
3. Go to Labs –> Fetch as Googlebot
Search engines key processes
In conclusion, the three key processes that Search engines need in order to deliver search results to users are:
- Crawling: Does Google know about the existence of your website? Can Google find it?
- Indexing: Can Google index your website?
- Serving: Does your website have good quality and useful content that is relevant to potential user’s searches?
I hope you found this useful