Introduction to YouSeer

Slide 1:Introduction to YouSeer Madian Khabsa

Slide 2:Outline Overview YouSeer components Heritrix Solr Demo

Slide 3:Overview YouSeer: is a complete and powerful open source search engine available on SourceForge that integrates the open source crawler Heritrix with the open source indexer Solr/Lucene. Java-based, and reported to run successfully on Windows (need confirmation)

Slide 4:Search Engine: Basic Workflow

Slide 5:Advantages of YouSeer Built on top of scalable components Tested on 23M documents, while Solr and Heritrix can scale to billions Very flexible, and easy to extend Modifying the index and the ingestion module is easy The crawler supports complicated crawling policies

Slide 6:YouSeer Components Heritrix: The Internet Archive�s crawler Reported to scale up to 1B documents Written in Java, and has a web interface Apache Solr: open source enterprise search server based on the�Lucene Has REST-like API Supports caching, distributed search, and index replication

Slide 7:YouSeer Architecture

Slide 8:Heritrix Workflow 1) Choose a URI from all among the scheduled 2) Fetch that URI 3)Analyze or archive the results 4) select discovered URIs of interest, and add to those scheduled 5) Note that the URI is done and repeat �An Introduction to Heritrix. An open source archival quality web crawler�. Gordon Mohr et al�An Introduction to Heritrix. An open source archival quality web crawler�. Gordon Mohr et al

Slide 9:Heritrix Architecture

Slide 10:Heritrix Crawl Result By default, heritrix writes all its crawled to disk as Internet Archive ARC files By default, Heritrix writes compressed version 1 ARC files The compression is done with gzip Each record (which contain a document) is gzipped All gzipped records are concatenated together to make up a file of multiple gzipped members

Slide 11:ARC Record Example http://www.dryswamp.edu:80/index.html 127.10.100.2 19961104142103text/html 200 fac069150613fe55599cc7fa88aa089d - 209 IA-001102.arc 202HTTP/1.0 200 Document followsDate: Mon, 04 Nov 1996 14:21:06 GMTServer: NCSA/1.4.1Content-type: text/html Last-modified: Sat,10 Aug 1996 22:33:11 GMTContent-length: 30<HTML>Hello World!!!</HTML>

Slide 12:Some Heritrix features Override crawling policy per domain Include/Exclude specific media types, domains, file size. Limit the number of the threads, connection bandwidth, writing threads Obey or ignore Robots.txt PLEASE OBEY ROBOTS.TXT

Slide 13:Apache Solr Very popular distribution of Lucene Easy to configure and optimize All modifications are in the XML files No need to touch the code The index has a schema, similar to database schema Think of the index as a table in the database, and you have to define the columns

Slide 14:Solr Schema Example <field name="url" type="string" indexed="true" stored="true"/> <field name="title" type="text" indexed="true" stored="true"/> <field name="keywords" type="text_ws" indexed="true" stored="true" multiValued="true" omitNorms="true"/> <field name="creationDate" type="date" indexed="true" stored="true"/> <field name="rating" type="sint" indexed="true" stored="true"/> <field name="published" type="boolean" indexed="true" stored="true"/> <field name="content" type="text" indexed="true" stored="true" /> <field name="all" type="text" indexed="true" stored="true" multiValued="true"/>

Slide 15:Solr Documents Solr accepts well formatted XML documents <add> <doc> <field name=�URL">www.cnn.com</field> <field name=�title">CNN Breaking News � Obama wins</field> <field name=�content">Barack Obama is the 44th president of the USA</field> <field name=�pubDate">2008-11-06T23:59:59.999Z</field> </doc> </add>

Slide 16:Solr Features Modify the ranking function by editing XML file, or along with your HTTP query request Supports query suggestions, aka autocomplete Provides spelling corrections from the terms in the index Supports wildcard and range queries Example: porch*, ferr* Example: popularity:[10 TO *] http://wiki.apache.org/solr/SolrQuerySyntax http://wiki.apache.org/solr/SolrQuerySyntax

Slide 17:YouSeer workflow Waits for the crawled documents to be written Iterates on the compressed files, and process the documents Extract the textual content of the document, and parse metadata Generate an XML file as output Each custom extractor appends its result to this file This XML file is submitted to the index

Slide 18:Document formats supported

Slide 19:Demo: Configurtion The schema of Solr is already configured in your installation Solr is installed on tomcat Heritrix web interface is listening on the port 91XX, where XX is your team numebr i.e. team 3: heritrix listens on 9103

Slide 20:Demo The server is: ist441.ist.psu.edu Heritrix port is blocked by the ITS You need SSH tunneling For Linux/Mac users: SSH �L 91XX:localhost:91XX teamX@ist441 Where XX is your team number Navigate to localhost:91XX to access heritrix

Slide 21:Demo Tunneling for Windows users Install putty http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html Double click putty.exe Enter the host name as ist441.ist.psu.edu The connection type is SSH Navigate to Connection->SSH->Tunnels Source port is 91XX Destination address is localhost:91XX Press ADD Press open http://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.htmlhttp://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.html

Slide 22:Demo: Putty http://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.htmlhttp://www.ytechie.com/2008/05/set-up-a-windows-ssh-tunnel-in-10-minutes-or-less.html

Slide 23:Demo: Putty

Slide 24:Demo: Heritrix After connecting to the server, go to ~/crawler/heritrix-1.14.3/bin Execute: ./heritrix --admin=teamX:Password Now navigate to heritrix interface form you browser Log in, and create your first job http://crawler.archive.org/articles/user_manual/tutorial.html

Slide 25:Demo: Heritrix Most important parameter is user agent under configurations Enter a valid URL and email address Enter http://www.psu.edu And your OWN email address Do not run more than 5 threads Do not crawl publisher�s websites

Slide 26:Demo: Heritrix Heritrix log in screen Heritrix log in screen

Slide 27:Demo: Heritrix Enter the Seed URLsEnter the Seed URLs

Slide 28:Demo: Heritrix Change the Agent URLChange the Agent URL

Slide 29:Demo: Heritrix If everything goes well, after you submit you should see this screenIf everything goes well, after you submit you should see this screen

Slide 30:Demo ARC files are written to: ~/crawler/heritrix-1.14.3/jobs/JOB-NAME/arcs To start tomcat, enter start-tomcat Solr will start automatically YouSeer ingestion module is located under: ~/youseer/release SubmitterConfig.xml Provides the file types you want to index Provides the database name Provides mapping to the index schema

Slide 31:Demo To index documents crawled by heritrix: Navigate to ~/youseer/release Run: java �jar YouSeer.jar http://localhost:90XX/solr/update /absolute/path/to/arc/files /cachingDirectory 1 0 Solr URL The full path to the ARC files The virtual directory which maps to the cached files Number of threads, please keep it 1 Waiting Time between retries

Slide 32:Comments YouSeer tracks which arc files has been processed into the database, default name is submitted.db If you want to re-ingest the documents,remove the db file Check the log file The search interface: http://ist441.ist.psu.edu:90XX/youseerui

Slide 33:Q & A

Slide 34:References http://youseer.sourceforge.net/doc/Tutorial.pdf http://crawler.archive.org/articles/user_manual/ https://webarchive.jira.com/wiki/download/attachments/5441/Mohr-et-al-2004.pdf http://www.packtpub.com/solr-1-4-enterprise-search-server?utm_source=http%3A%2F%2Flucene.apache.org%2Fsolr%2F&utm_medium=spons&utm_content=pod&utm_campaign=mdb_000275 http://www.lucidimagination.com/Downloads/LucidWorks-for-Solr/Reference-Guide?sc=AP

Introduction to YouSeer

Introduction to YouSeer

Presentation Transcript

INTRODUCTION TO…

Introduction to

Introduction to

Introduction to

Introduction to introduction to introduction to … Optimization

Introduction to

Introduction to Bioinformatics Introduction to Databases

Introduction to Engineering Introduction to CAD

Introduction to Introduction to Database Systems

Introduction to Introduction to Psychology

INTRODUCTION TO

INTRODUCTION to

Introduction to

Introduction to Concurrency: Introduction to Concurrency

Introduction to

Introduction to YouSeer

Introduction to

Introduction to

Introduction to Psychophysiology Lecture 1- introduction to introduction

Introduction to Introduction to Artificial Intelligence