Although most web browsers consist of only one input field and a button, and on the other side of the link list below this inconvenient abilities, the whole machine is trying to hide the relevant results for the entered term.
The mode of operation of each modern web browser is shown schematically. Basically, it consists of 3 phases:
– Collecting information (Crawling)
– Indexing the collected data
– Search and ranking results
In order for a web browser to return a document as a result of a query, it must first find it. To find information on billions of web pages, web browsers use spiders. Web spider (Web Crawler, Web Robot, Web Bot) is a program or script that automates web browsing by collecting page information. This process of collecting information is called Web Crawling or spidering.
Two very important characteristics of the Web dictate the behavior of the spider and make their task very difficult:
– A large number of pages. This results in the fact that spiders can only visit a fraction of the web, which means that this partition should be specifically selected.
– The speed of change. While the spider visits the last page on the site, it is very likely that in the meantime some pages have been added, some have been deleted, and some have been modified. This is especially characteristic for large sites.
The architecture of the spider
In order for a spider to be efficient, it must also have an extremely optimized architecture. It’s very easy to make spiders that will download a few pages a second and that will work shortly, however, making an efficient and robust spider that will download thousands of millions of pages in a few weeks is a big challenge.
Two basic elements of the spider
Shkapenyuk and Suel presented an example of the architecture of a spider in their work. According to them, each spider consists of two main components:
– Crawling Application (Eng. Crawling Application)
– Crawling System (Eng. Crawling System)
The Crawling application has the task of making the decision that the next URL (URL) the Crawling system needs to visit. It has the ability to download each downloaded (download) page in search of links, check that the URL is already visited and if it has not been forwarded to the crawling system. Which next link will be visited is determined based on some of the many selection strategies and re-visit strategies
The architecture of the crawling system
The Crawling system has the task of removing the requested page and forwarding it to crawling the application for analysis and storage. It consists of several specialized components
The Crawl manager is responsible for receiving the URL from the Crawling application and forwarding it to the free downloader, paying attention to the rules that can be found in the robots.txt file
Web spiders are the central part of every web browser and because of that the architecture of each of the commercial spiders is a strictly guarded business secret