This is a survey of the science and practice of web crawling. While at first glance web crawling may appear to be merely an application of breadth-first-search, the truth is that there are many challenges ranging from systems concerns such as managing very large data structures, to theoretical questions such as how often to revisit evolving content sources. This survey outlines the fundamental challenges and describes the state-of-the-art models and solutions. It also highlights avenues for future work.
What people are saying - Write a review
We haven't found any reviews in the usual places.
Crawl Ordering Problem
Incremental Crawl Ordering
algorithms approach archive assign Baeza-Yates batch crawling breadth-first search bucket cache change frequency change rate Cho and Garcīıa-Molina content items content providers corpus crawl ordering policies crawl rate crawled pages crawler Section crawler traps crawling process data structure data values database deep web crawling deep web sites detection discussed in Section disk-based distributed Duplicate URL eliminator example extract FIFO FIFO queue focused form interfaces Garcīıa-Molina 41 graph greediness host component hyperlinks in-memory incremental crawling indegree Information Retrieval International World Wide Internet Archive IP address IRLbot maintain Mercator Microsoft Research multiple Najork page’s PageRank perform priority queue problem Proceedings protocol query interface re-downloading relevant pages resource allocation revisitation frequency robots exclusion protocol scalable scheduling scoped crawling search engines server set of URLs site-level tion topical crawling typically unstructured updates URLs URLs URLs web crawler web graph web server WebFountain weighted coverage Wide Web Conference World Wide Web