What is web spider and how does it works?
Information, Mon, 06 Mar 2017, 08:33am GMT
Web Spider - A Web spider is a program or automated script which systematically browses the World Wide Web in a methodical, automated manner. Spiders are used to indexing pages for web search engines. It's called a spider because it crawls over the Web. Another term for these programs is web crawler.
How does it work?
The usual starting points are lists of heavily used servers and very popular pages. The spider will begin with a popular site, indexing the words on its pages and following every link found within the site. In this way, the spidering system quickly begins to travel, spreading out across the most widely used portions of the Web. When the spider doesn’t find a page, it will eventually be deleted from the index. However, some of the spiders will check again for a second time to verify that the page really is offline.
The first thing a spider is supposed to do when it visits your website is look for a file called “robots.txt”. This file contains instructions for the spider on which parts of the website to index, and which parts to ignore. The only way to control what a spider sees on your site is by using a robots.txt file. All spiders are supposed to follow some rules, and the major search engines do follow these rules for the most part. Fortunately, the major search engines like Google or Bing are finally working together on standards.
Google began as an academic search engine. In the paper that describes how the system was built, how quickly their spiders can work. They built their initial system to use multiple spiders, usually three at one time. Each spider could keep about 300 connections to Web pages open at a time. At its peak performance, using four spiders, their system could crawl over 100 pages per second, generating around 600 kilobytes of data each second.
Keeping everything running quickly meant building a system to feed necessary information to the spiders. The early Google system had a server dedicated to providing URLs to the spiders. Rather than depending on an Internet service provider for the domain name server that translates a server's name into an address, Google had its own DNS, in order to keep delays to a minimum.
When the Google spider looked at an HTML page, it took note of two things - 1) The words within the page, 2) Where the words were found.
Words occurring in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles "a," "an" and "the." Other spiders take different approaches.
Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code.
Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).