Unveiling Lies: A Deep Dive into Data Analysis Strategies
E N D
Presentation Transcript
Caltech Theses Collection Usage Analysis Ed Sponsler George Porter Betsy Coles California Institute of Technology Library System
Three Kinds of Lies • White Lies • Damned Lies • Statistics
Examinig the Data’s Details • Study the data: What created it? Human? Computer? What does it mean? • WRONG: How can the data address my questions? • RIGHT: What questions can the data address?
Caltech Theses Facts • First Digital Deposit: July, 2001 • Number of Theses: 1208 • Software Used: VT ETDdb (but not for much longer) • Campus Mandate: June, 2002 • Defense Date Range: 1922 to present
Caltech Theses Statistics • Data Source: Apache Web Logs • What is an access? • What can be ignored and why? • What do human v robot accesses look like? • What is a referrer? User Agent? Host IP? Requested Object?
Apache Combined Log Format 63.89.199.36 - - [21/Jul/2003:12:53:01 -0700] "GET /etd/available/etd-12182002-190040/unrestricted/thesis.pdf HTTP/1.1" 200 15767 "http://etd.caltech.edu/etd/available/etd-2182002-190040/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)"
DeDupe The dedupe filter ensures that a host may access a thesis only one time. Duplicate attempts are ignored, even if the request is for a different file from the same thesis, such as a different Chapter.
DeDupe The result of the dedupe filter is an access_log containing at most one log entry for each unique host that has accessed any file of a given thesis.
DeDupe Data Structure Theses ID etd-3493 etd-1139 etd-944 Host IP 131.212.13.22 124.24.21.1 145.46.55.6 access_log 131.212.13.22 - - [21/Jul/2003:12 124.24.21.1 - - [12/Aug/2003:15 145.46.55.6 - - [05/Sep/2003:05 131.212.13.22 - - [20/Sep/2003:04 133.25.5.12 - - [28/Sep/2003:11 154.21.78.9 - - [03/Oct/2003:09 131.215.12.22 - - [05/Janl/2004:02 133.42.3.99 - - [09/Jan/2004:07 101.24.21.99 - - [14/Feb/2004:01 Host IP 131.212.13.22 133.25.5.12 154.21.78.9 Host IP 131.215.12.22 133.42.3.99 101.24.21.99
User Agents Internet Explorer 60% Known Human Users 71% Netscape 11% Googlebot 14% Bots/Harvesters/Other 29% Other 15%
Country of Origin Report GeoIP database contains IP blocks and their country of origin More useful and complete than top level domain names (.edu, .de, .uk, etc)
United States | 76294 China | 7943 Germany | 4763 United Kingdom | 4646 Canada | 3918 India | 3328 Japan | 3271 France | 2887 Italy | 2066 Taiwan | 2063 Korea | 1639 Spain | 1300 Australia | 1249 Netherlands | 1239 Iran | 1208 Malaysia | 1160 Hong Kong | 1007 Turkey | 961 Brazil | 860 Poland | 853 Singapore | 847 Russian Fed. | 812 Switzerland | 810 Sweden | 759 Israel | 743 Belgium | 735 Mexico | 724 Thailand | 648 Egypt | 542 Greece | 511 Romania | 480 Vietnam | 455 Indonesia | 451 Portugal | 438 Finland | 419 Philippines | 418 Geographic Analysis153 countries represented
Most Popular Theses Count Defense Date 3322 2000-10-23 3199 2002-08-07 3174 2002-07-16 2457 2001-10-23 2153 2002-10-02 2120 2002-09-25 2098 2001-05-18 2073 2002-10-04 1959 2002-11-05 1848 2003-01-14 1675 2002-08-14 1614 2002-05-02 Count Defense Date 1486 2002-09-04 1378 2003-09-02 1304 2001-02-09 1296 2003-05-15 1176 2003-05-15 1134 2001-05-07 1130 2002-01-16 1124 2001-03-08 1123 2003-06-02 1091 2001-01-19 1087 2003-03-20
Most Popular Theses Defense Date Title (>1000 downloads) 2000-10-23 Blocking Adhesion to Cell and Tissue Surfaces via Steric Stabilization with Graft Copolymers containing Poly(Ethylene Glycol) and Phenylboronic Acid 2002-08-07 Electrochemical Sensors Based on DNA- Mediated Charge Transport Chemistry 2002-07-16 Effects of Surface Modification on Charge-Carrier Dynamics at Semiconductor Interfaces 2001-10-23 I. Seafloor Morphology of the Osbourn Trough and Kermadec Trench and II. Multiscale Dynamics of Subduction Zones 2002-10-02 I. Structure-Function Analysis of the Mechanosensitive Channel of Large Conductance. II. Design of Novel Magnetic Materials using Crystal Engineering.
Most Popular Theses Defense Date Title 2002-09-25 Modeling a Hox Gene Network: Stochastic Simulation with Experimental Perturbation 2001-05-18 All-Optical Logic Circuits based on the Polarization Properties of Non-Degenerate Four- Wave Mixing 2002-10-04 Site-specific incorporation of synthetic amino acids into functioning ion channels 2002-11-05 Impact-Ionization Mass Spectrometry of Cosmic Dust 2003-01-14 Force-Detected Nuclear Magnetic Resonance Independent of Field Gradients 2002-08-14 Fast, High-Order Methods for Scattering by Inhomogeneous Media- 2002-05-02 Neural dynamics underlying complex behavior in a songbird 2002-09-04 Spectroscopic Characterization of DNA-mediated Charge Transfer
Most Popular Theses Defense Date Title 2003-09-02 Protein Engineering Through in vivo Incorporation of Phenylalanine Analogs 2001-02-09 Synthesis, Passivation and Charging of Silicon Nanocrystals 2003-05-15 Sensitizer-linked substrates as probes of heme enzyme structure and catalysis 2003-05-15 Mirror Thermal Noise in Interferometric Gravitational Wave Detectors 2001-05-07 Analysis and Design of Turbo-like Codes 2002-01-16 Computational Enzyme Design 2001-03-08 An Investigation of Ion Engine Erosion by Low Energy Sputtering 2003-06-02 Laboratory Evolution of Cytochrome P450 Peroxygenase Activity 2001-01-19 Passive Hypervelocity Boundary Layer Control Using an Acoustically Absortive Surface 2003-03-20 Mapping the cytochrome c folding landscape
Human / Robot Split Human activity identified by ‘MSIE’ or ‘Mozilla’ In the User Agent field of the apache_log
Referrers by Human UseMSIE | Mozilla • etd.caltech.edu 33% • www.google.com 32% • search.yahoo.com 8% • www.google.de 3% • all others <2% (each) • 492 total referrers
Most Active RobotsSince April, 2004 Googlebot | 3524 Googlebot/Test | 1100 TurnitinBot | 362 Wget | 252 msnbot | 162 DA | 41 Contype | 36 ia_archiver | 33 FAST-WebCrawler | 18 NPBot | 16 NetAnts | 16
Summary • Keep Statistics Honest: understand and scrub your data before analysis • Google is key for discovery • Theses are popular because they are new and have useful content
Next Steps • Compare download frequencies, not just totals • Create local IP -> domain name database • Adapt DeDupe to CODA EPrints Archives
Caltech Library System’s Online Digital Archives Theses http://etd.caltech.edu All Archives http://coda.caltech.edu