TY - JOUR
T1 - An investigation of web crawler behavior
T2 - Characterization and metrics
AU - Dikaiakos, Marios D.
AU - Stassopoulou, Athena
AU - Papageorgiou, Loizos
PY - 2005/5/16
Y1 - 2005/5/16
N2 - In this paper, we present a characterization study of search-engine crawlers. For the purposes of our work, we use Web-server access logs from five academic sites in three different countries. Based on these logs, we analyze the activity of different crawlers that belong to five search engines: Google, AltaVista, Inktomi, FastSearch and CiteSeer. We compare crawler behavior to the characteristics of the general World-Wide Web traffic and to general characterization studies. We analyze crawler requests to derive insights into the behavior and strategy of crawlers. We propose a set of simple metrics that describe qualitative characteristics of crawler behavior, vis-à-vis a crawler's preference on resources of a particular format, its frequency of visits on a Web site, and the pervasiveness of its visits to a particular site. To the best of our knowledge, this is the first extensive and in depth characterization of search-engine crawlers. Our results and observations provide useful insights into crawler behavior and serve as basis of our ongoing work on the automatic detection of Web crawlers.
AB - In this paper, we present a characterization study of search-engine crawlers. For the purposes of our work, we use Web-server access logs from five academic sites in three different countries. Based on these logs, we analyze the activity of different crawlers that belong to five search engines: Google, AltaVista, Inktomi, FastSearch and CiteSeer. We compare crawler behavior to the characteristics of the general World-Wide Web traffic and to general characterization studies. We analyze crawler requests to derive insights into the behavior and strategy of crawlers. We propose a set of simple metrics that describe qualitative characteristics of crawler behavior, vis-à-vis a crawler's preference on resources of a particular format, its frequency of visits on a Web site, and the pervasiveness of its visits to a particular site. To the best of our knowledge, this is the first extensive and in depth characterization of search-engine crawlers. Our results and observations provide useful insights into crawler behavior and serve as basis of our ongoing work on the automatic detection of Web crawlers.
KW - Crawlers
KW - Web characterization
UR - http://www.scopus.com/inward/record.url?scp=17644390582&partnerID=8YFLogxK
U2 - 10.1016/j.comcom.2005.01.003
DO - 10.1016/j.comcom.2005.01.003
M3 - Article
AN - SCOPUS:17644390582
SN - 0140-3664
VL - 28
SP - 880
EP - 897
JO - Computer Communications
JF - Computer Communications
IS - 8
ER -