170 likes | 388 Vues
Logfile-Preprocessing using WUMprep. A Perl-Script-Suite that does more than just filtering raw data. about WUMprep 1. WUMprep is part of Open Source-Project HypKnowSys and written by Carsten Pohle it comprises Logfile-Preprocessing in two ways: filtering
E N D
Logfile-Preprocessing using WUMprep A Perl-Script-Suite that does more than just filtering raw data
about WUMprep 1 • WUMprep is part of Open Source-Project HypKnowSys and written by Carsten Pohle • it comprises Logfile-Preprocessing in two ways: • filtering • adding meaning to WebSites (Taxonomies) • it can be used both as stand-alone and in conjunction with other Mining-Tools (e.g. WUM)
Configuring WUMprep (1) • wumprep.conf is used for defining basic needs of each script • Just give your domain and your input-Log – that will do for the moment • before running removeRobots.pl you can define the sec. in Timestamp Question: which value is appropriate? wumprep .conf
Next step: logfileTemplate (config 2) • The 4 basic Web-Server Log-Formats are defined in WUMprep´s logFormat.txt • According to a given Format you arrange logfileTemplate. • Basically anything goes but if the Log is queried from a MySQL-database remember that Host, Timestamp and Agent are mandatory (and Referrer at least helpfull)1 1See Nicolas Michael´s Presentation for Details conc. problems with the basic algorithm format.txt logfile Template
You have this format: koerting.hannover.kkf.net - - [01/Mar/2003:00:34:41 -0700] "GET /css/styles.css HTTP/1.1" 200 7867 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)" You have this format 200.11.240.17 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98) 2001-10-20 00:00:00 2399027 you take this template @host_dns@ @auth_user@ @ident@ [@ts_day@/@ts_month@/@ts_year@:@ts_hour@:@ts_minutes@:@ts_seconds@ @tz@] "@method@ @path@ @protocol@" @status@ @sc_bytes@ @referrer@ @agent@ you take this template @host_ip@ @agent@ @ts_year@-@ts_month@-@ts_day@ @ts_hour@:@ts_minutes@:@ts_seconds@ @dummy@ Usage of logfileTemplate (config 3) NB: Have a close look at your logfile and arrange logfileTemplate by following exactly the given Format
Dealing with Unstandardized Format (config 4) • This last slide´s example is taken from a MySQL-Database: 200.11.240.17 Mozilla/4.0 (compatible; MSIE 5.5; Windows 98) 2001-10-20 00:00:00 2399027 • You have an unusual Timestamp-Format (see adapting the Template in last slide), and a missing Referrer sessionize.pl may look e.g. for foreign referrers to start a new session
Configuring wumprep.conf (5) • Go to [sessionizeSettings] in wumprep.conf. • Out-comment everything that deals with Referrers. It´ll look like this: # Set to true if the sessionizer should insert dummy hits to the # referring document at the beginning of each session. #sessionizeInsertReferrerHits = true # Name of the GET query parameter denoting the referrer (leave blank # if not applicable) #sessionizeQueryReferrerName = referrer # Should a foreign referrer start a new session? #sessionizeForeignReferrerStartsSession = 0
We´re ready to go: sessionize the Log • If no Cookie ID is given, sessionize.pl will look for the Host and the Timestamp. There is a Threshold q= 1800 sec. in wumprep.conf. Let t0 be the first entry. Then a Session is computed by taking any URL, whose Timestamp lies in between t - t0 ≤ qas a sequent request of t0
detectRobots • There are two Types of Robots: ethical and non-ethical (let´s say three: the good, the bad and the very ugly ;-). • The first type acts according the 'Robots Exclusion Standards' and looks first in a file called Robots.txt where to go and where not. • Removing them is done over the Robot-Database indexers.lst. Additionally, detectRobots.pl flags IPs as robots when they accessed robots.txt
detectRobots (2) • The 2nd type with its IP and Agent looking like coming from a Human is difficult to detect and requires a sessionized Log. • There is (besides two others) a time-based heuristic to remove them: Too many html-requests in a given time is very likely to come from a Robot. Default value in wumprep.conf is 2 sec.
detectRobots (3) • you can add entries to indexers.lst by taking a larger Log and type at command-line: grep "bot" logfile |awk '{print "robot-host: ", $1}' |sort |uniq >>indexers.lst • detectRobots.pl removes 6% before and 17% of Robot-Entries after running the script in my Logs (2668:2821Kb vs. 2360:2821Kb of xyz.nobots:xyz.sess) • There will allways remain some uncertainty about Robot-Detection. Further research is necessary.
Further Data Cleaning thankfully is much easier logFilter.pl uses your Filter Rules in wumprep.conf. You can define your own Filter Rules or add them to wumprep.conf \.ico \.gif \.jpg \.jpeg \.css \.js \.GIF \.JPG # @mydomainFilter.txt logFilter
Taxonomies • Taxonomies are built using Regular Expressions: map your Site according a Taxonomy and MapreTaxonomies.pl uses your predefined regexes to overwrite the requests in the log to your Site Concept. • It´ll look something like this: HOME www\.c-o-k\.de\/$ METHODS \/cp_\.htm\?fall=3\/ TOOLS \/cp_\.htm\?fall=1\/ FIELDSTUDIES \/cp_.htm?fall=2\/
Taxonomies II • This is what MapreTaxonomies.pl does with it (Aggregation). 117858:1|80.136.155.126 - - [29/Mar/2003:00:02:00 +0100] "GET AUTHOR-FORMAT/LITERATURDB HTTP/1.1" 200 1406 "-" "Mozilla/5.0 (Windows; U; Win 9x 4.90; de-DE; rv:1.0.1) Gecko/20020823 Netscape/7.0" 117858:1|80.136.155.126 - - [29/Mar/2003:00:02:00 +0100] "GET AUTHOR-FORMAT/LITERATURDB HTTP/1.1" 200 10301 "http://edoc.hu-berlin.de/conferences/conf2/Kuehne-Hartmut-2002-09-08/HTML/kuehne-ch1.html" "Mozilla/5.0 (Windows; U; Win 9x 4.90; de-DE; rv:1.0.1) Gecko/20020823 Netscape/7.0" • This Data Aggregation is a neccessary step before working with WUM
Taxonomies III • Above that, Carsten Pohle wants to use them as a Filter for uninteresting Patterns one usually gets out of Association Rules any Pattern that matches the Taxonomy (via mapReTaxonomies.pl) is most likely to be uninteresting
Further Reading • Berendt, Mobasher, Spiliopoulou, Wiltshire, Measuring the Accuracy of Sessionizers for Web Usage Analysis • Pang-Ning Tan and Vipin Kumar, Discovery of Web Robot Sessions based on their Navigational Patterns, in: Data Mining and Knowledge Discovery, 6 (1) (2002), S. 9-35. • Nicolas Michael, Erkennen von Web-Robotern anhand ihres Navigationsmusters (on Berendt, HS Web Mining SS03) • Gebhard Dettmar, Knowledge Discovery in Databases - Methodik und Anwendungsbereiche, Knowledge Discovery in Databases, Teil II - Web Mining
Logfile-Preprocessing via WUMprep Thanks for Listening!