A probabilistic reasoning approach for discovering web crawler sessions

Athena Stassopoulou, Marios D. Dikaiakos

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper we introduce a probabilistic-reasoning approach to detect Web robots (crawlers) from human visitors of Web sites. Our approach employs a Naive Bayes network to classify the HTTP sessions of a Web-server access log as crawler or human induced. The Bayesian network combines various pieces of evidence that were shown to distinguish between crawler and human HTTP traffic. The parameters of the Bayesian network are determined with machine learning techniques, and the resulting classification is based on the maximum posterior probability of all classes, given the available evidence. Our method is applied on real Web logs and provides a classification accuracy of 95%. The high accuracy with which our system detects crawler sessions, proves the robustness and effectiveness of the proposed methodology.

Original languageEnglish
Title of host publicationAdvances in Data and Web Management - Joint 9th Asia-Pacific Web Conference, APWeb 2007 and 8th International Conference on Web-Age Information Management, WAIM 2007, Proceedings
Pages265-272
Number of pages8
Volume4505 LNCS
Publication statusPublished - 2007
EventJoint 9th Asia-Pacific Web Conference on Advances in Data and Web Management, APWeb 2007 and 8th International Conference on Web-Age Information Management, WAIM 2007 - Huang Shan, China
Duration: 16 Jun 200718 Jun 2007

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4505 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

OtherJoint 9th Asia-Pacific Web Conference on Advances in Data and Web Management, APWeb 2007 and 8th International Conference on Web-Age Information Management, WAIM 2007
Country/TerritoryChina
CityHuang Shan
Period16/06/0718/06/07

Fingerprint

Dive into the research topics of 'A probabilistic reasoning approach for discovering web crawler sessions'. Together they form a unique fingerprint.

Cite this