330 likes | 1.04k Vues
By default, heritrix writes all its crawled to disk as Internet Archive ARC files. By default, Heritrix writes compressed version 1 ARC files ...
E N D
Slide 1:Introduction to YouSeer Madian Khabsa
Slide 2:Outline Overview
YouSeer components
Heritrix
Solr
Demo
Slide 3:Overview YouSeer: is a complete and powerful open source search engine available on SourceForge that integrates the open source crawler Heritrix with the open source indexer Solr/Lucene.
Java-based, and reported to run successfully on Windows (need confirmation)
Slide 4:Search Engine: Basic Workflow
Slide 5:Advantages of YouSeer Built on top of scalable components
Tested on 23M documents, while Solr and Heritrix can scale to billions
Very flexible, and easy to extend
Modifying the index and the ingestion module is easy
The crawler supports complicated crawling policies
Slide 6:YouSeer Components Heritrix:
The Internet Archives crawler
Reported to scale up to 1B documents
Written in Java, and has a web interface
Apache Solr:
open source enterprise search server based on theLucene
Has REST-like API
Supports caching, distributed search, and index replication
Slide 7:YouSeer Architecture
Slide 8:Heritrix Workflow 1) Choose a URI from all among the scheduled
2) Fetch that URI
3)Analyze or archive the results
4) select discovered URIs of interest, and add to those scheduled
5) Note that the URI is done and repeat An Introduction to Heritrix. An open source archival quality web crawler. Gordon Mohr et alAn Introduction to Heritrix. An open source archival quality web crawler. Gordon Mohr et al
Slide 9:Heritrix Architecture
Slide 10:Heritrix Crawl Result By default, heritrix writes all its crawled to disk as Internet Archive ARC files
By default, Heritrix writes compressed version 1 ARC files
The compression is done with gzip
Each record (which contain a document) is gzipped
All gzipped records are concatenated together to make up a file of multiple gzipped members
Slide 11:ARC Record Example http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103text/html 200 fac069150613fe55599cc7fa88aa089d - 209 IA-001102.arc 202HTTP/1.0 200 Document followsDate: Mon, 04 Nov 1996 14:21:06 GMTServer: NCSA/1.4.1Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMTContent-length: 30<HTML>Hello World!!!</HTML>
Slide 12:Some Heritrix features Override crawling policy per domain
Include/Exclude specific media types, domains, file size.
Limit the number of the threads, connection bandwidth, writing threads
Obey or ignore Robots.txt
PLEASE OBEY ROBOTS.TXT
Slide 13:Apache Solr Very popular distribution of Lucene
Easy to configure and optimize
All modifications are in the XML files
No need to touch the code
The index has a schema, similar to database schema
Think of the index as a table in the database, and you have to define the columns
Slide 14:Solr Schema Example <field name="url" type="string" indexed="true" stored="true"/>
<field name="title" type="text" indexed="true" stored="true"/>
<field name="keywords" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true"/>
<field name="creationDate" type="date" indexed="true" stored="true"/>
<field name="rating" type="sint" indexed="true" stored="true"/>
<field name="published" type="boolean" indexed="true" stored="true"/>
<field name="content" type="text" indexed="true" stored="true" />
<field name="all" type="text" indexed="true" stored="true" multiValued="true"/>
Slide 15:Solr Documents Solr accepts well formatted XML documents
<add> <doc>
<field name=URL">www.cnn.com</field>
<field name=title">CNN Breaking News Obama wins</field>
<field name=content">Barack Obama is the 44th president of the USA</field>
<field name=pubDate">2008-11-06T23:59:59.999Z</field>
</doc> </add>
Slide 16:Solr Features Modify the ranking function by editing XML file, or along with your HTTP query request
Supports query suggestions, aka autocomplete
Provides spelling corrections from the terms in the index
Supports wildcard and range queries
Example: porch*, ferr*
Example: popularity:[10 TO *] http://wiki.apache.org/solr/SolrQuerySyntax
http://wiki.apache.org/solr/SolrQuerySyntax
Slide 17:YouSeer workflow Waits for the crawled documents to be written
Iterates on the compressed files, and process the documents
Extract the textual content of the document, and parse metadata
Generate an XML file as output
Each custom extractor appends its result to this file
This XML file is submitted to the index
Slide 18:Document formats supported
Slide 19:Demo: Configurtion The schema of Solr is already configured in your installation
Solr is installed on tomcat
Heritrix web interface is listening on the port 91XX, where XX is your team numebr
i.e. team 3: heritrix listens on 9103
Slide 20:Demo The server is: ist441.ist.psu.edu
Heritrix port is blocked by the ITS
You need SSH tunneling
For Linux/Mac users:
SSH L 91XX:localhost:91XX teamX@ist441
Where XX is your team number
Navigate to localhost:91XX to access heritrix
Slide 21:Demo Tunneling for Windows users
Install putty
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
Double click putty.exe
Enter the host name as ist441.ist.psu.edu
The connection type is SSH
Navigate to Connection->SSH->Tunnels
Source port is 91XX
Destination address is localhost:91XX
Press ADD
Press open
http://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.htmlhttp://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.html
Slide 22:Demo: Putty http://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.htmlhttp://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.html
Slide 23:Demo: Putty
Slide 24:Demo: Heritrix After connecting to the server, go to
~/crawler/heritrix-1.14.3/bin
Execute: ./heritrix --admin=teamX:Password
Now navigate to heritrix interface form you browser
Log in, and create your first job
http://crawler.archive.org/articles/user_manual/tutorial.html
Slide 25:Demo: Heritrix Most important parameter is user agent under configurations
Enter a valid URL and email address
Enter http://www.psu.edu
And your OWN email address
Do not run more than 5 threads
Do not crawl publishers websites
Slide 26:Demo: Heritrix Heritrix log in screen
Heritrix log in screen
Slide 27:Demo: Heritrix Enter the Seed URLsEnter the Seed URLs
Slide 28:Demo: Heritrix Change the Agent URLChange the Agent URL
Slide 29:Demo: Heritrix If everything goes well, after you submit you should see this screenIf everything goes well, after you submit you should see this screen
Slide 30:Demo ARC files are written to:
~/crawler/heritrix-1.14.3/jobs/JOB-NAME/arcs
To start tomcat, enter start-tomcat
Solr will start automatically
YouSeer ingestion module is located under:
~/youseer/release
SubmitterConfig.xml
Provides the file types you want to index
Provides the database name
Provides mapping to the index schema
Slide 31:Demo To index documents crawled by heritrix:
Navigate to ~/youseer/release
Run: java jar YouSeer.jar http://localhost:90XX/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0
Solr URL
The full path to the ARC files
The virtual directory which maps to the cached files
Number of threads, please keep it 1
Waiting Time between retries
Slide 32:Comments YouSeer tracks which arc files has been processed into the database, default name is submitted.db
If you want to re-ingest the documents,remove the db file
Check the log file
The search interface:
http://ist441.ist.psu.edu:90XX/youseerui
Slide 33:Q & A
Slide 34:References http://youseer.sourceforge.net/doc/Tutorial.pdf
http://crawler.archive.org/articles/user_manual/
https://webarchive.jira.com/wiki/download/attachments/5441/Mohr-et-al-2004.pdf
http://www.packtpub.com/solr-1-4-enterprise-search-server?utm_source=http%3A%2F%2Flucene.apache.org%2Fsolr%2F&utm_medium=spons&utm_content=pod&utm_campaign=mdb_000275
http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide?sc=AP