1 / 17

Attempting to Use Wikipedia Categories to Improve Retrieval

Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik , 1st March 2013. Task Description. 3.2M documents from English language Wikipedia 140 queries

larue
Télécharger la présentation

Attempting to Use Wikipedia Categories to Improve Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Attempting to Use Wikipedia Categories to ImproveRetrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013

  2. TaskDescription • 3.2M documents from English language Wikipedia • 140 queries • Return a ranked list with 1000 documents for eachquery • UseLinked Data

  3. Document Collection • Approx. 30% ofthefiles describedeleted files, images, etc. • XML-like documents - Regex • Missingdocuments • Eachdocumentconsistsofthree parts: • Wikipedia article • DBPediaproperties • Yagoproperties

  4. <lodxmlxmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xml:lang='en' xmlns:xhtml='http://www.w3.org/1999/xhtml' encoding='UTF-8'> <articletitle='73425'> <wikipedia> <paragraph> <template type='Metadata'> <arg></arg> <tag name='id'>73425</tag> <tag name='title'>The_Deer_Hunter</tag> </template> <template type='Otheruses'> <arg></arg> <arg>Deer Hunter (disambiguation)</arg> </template> <infobox type='film'> <tag name='name'>The Deer Hunter</tag> ... <tag name='director'> <link> <wikilink href='./f4/05/522346.xml'>Michael Cimino</wikilink> <dbpediahref='http://dbpedia.org/resource/Michael_Cimino'></dbpedia> <yagoref='Michael_Cimino'></yago> </link> </tag> <dbpediaproperties> <propertyname='http://dbpedia.org/ontology/thumbnail'> <objectname='http://upload.wikimedia.org/wikipedia/commons/thumb/5/57/The_Deer_Hunter_poster.jpg/200px-The_Deer_Hunter_poster.jpg'></object> </property> ... <yagoproperties><propertyname='hasDuration'><objectname='10920.0#s'></object> </property> <propertyname='isCalled'><objectname='A szarvasvad\u00e1sz'></object> </property> The_Deer_Hunter

  5. guitarchord tuning guitarchordminor guitarclassical flamenco guitarclassicalbach guitaroriginRussia guitarorigin blues tango culturemovies tango culturecountries tango musiccomposers tango music instruments tango dance styles tango dancehistory vietnamwarmovie vietnamwarfacts vietnamfoodrecipes vietnamesefoodblog vietnam travel national park vietnam travel airports bicycle sport races bicycle sport disciplines bicycleholiday nature bicyclebenefitshealth bicyclebenefitsenvironment female rock singers south korean girlgroups electronicmusic genres digital musicnotation formats musicconferences intellectualpropertyrights lobby Queries

  6. Two stage approach • Traditionalretrieval • Improve by: • Using links betweendocuments • Using categories • Using Linked Data

  7. Stage One • Extracttitle, headings and categories from documents • Index using Indri – Krovetz stemming, stopword list • Weightedsearch – Title (10), Category (5), H2 (2), H3 (1) • Smoothing (ask Michael)

  8. ResultAfter Stage One Vietnam_War_Crimes_Working_Group Vietnam_War_in_film Operation_Sunrise_(Vietnam_War) Vietnam_War_Story_II Book:Vietnam_War Vietnam_during_World_War_I Vietnam_War_casualties Vietnam_War_Crimes_Working_Group_Files Puerto_Ricans_Missing_in_Action_in_the_Vietnam_War Star_Wars_Mini_Movie_Awards Vietnam_War_Memorial,_Hanoi 17th_Parallel:_Vietnam_in_War Vietnam:_The_Camera_At_War March_Against_the_Vietnam_War 1960_in_the_Vietnam_War 1961_in_the_Vietnam_War List_of_Vietnam_War_flying_aces List_of_wars_involving_Vietnam Outline_of_the_Vietnam_War The_War_Within:_America&apos;s_Battle_over_Vietnam Vietnam:_The_Ten_Thousand_Day_War Protests_against_the_Vietnam_War Matterhorn:_A_Novel_of_the_Vietnam_War List_of_bombs_in_the_Vietnam_War Puerto_Ricans_in_the_Vietnam_War Military_history_of_Australia_during_the_Vietnam_War Query: Vietnam War Movie

  9. Stage Two • Links betweendocuments • Categories • Linked Data • …

  10. Expand Query withWordnet Synonyms Original query: vietnamwarmovie vietnam -> annam war -> warfare movie -> film flick pic picture Expandedquery: vietnamannam war warfare film flick movie pic picture

  11. CalculateTextSimilaritybetweenExpanded Query and CategoryName Levenshteindistance: "The smallest number of insertions, deletions, and substitutions required to change one string or tree into another. " NIST (http://xlinux.nist.gov/dads/HTML/Levenshtein.html) Original query: Vietnam War Movie Expanded query: Vietnam Annam War Warefare Film Flick Movie Pic Picture Category: Vietnam War Films = 1 + (0*0 + 0*0 + 1*1) = 2 / 3 = 0.66

  12. 0.66 Vietnam War films 1 War films 2.33 Star Wars films 2.75 Star Wars fan films 3 Fan films 3 PunicWars 3 Star Wars 3.66 Gulf War films 3.75 World War I films 5.5 BarbaryWars 5.5 Boer Wars 5.5 Civilwars 5.5 Guild Wars 5.5 Opium Wars 5.66 Vietnam Warbooks 5.66 Vietnam Warnovels 5.66 Vietnam Warsites 5.75 World War II media 6.33 Flags of Vietnam 6.33 Laws ofwar 6.33 Media of Vietnam 6.33 MTV Movie Awards 6.8 Women in World War I 7 Floods in Vietnam 7 Songs ofthe Vietnam War 7.33 Star Warsbooks 7.33 Star Warscomics 7.5 Warcrimes in Vietnam 7.5 World War I games 7.5 World War II comics Threshold < 1 Categoriesranked by similarity to expandedquery Query: Vietnam War Movie

  13. Problems Expanded Query: Vietnam Annam War Warefare Film Flick Movie Pic Picture Category: Star Wars = 1 + (2*2 + 1*1) = 6 / 2 = 3 Frosker = Forsker? Homonyms – Vietnam War Picture Missingcategories

  14. ResultAfter Stage Two We_Were_Soldiers A_Better_Tomorrow_3 The_War_(film) Faith_of_My_Fathers_(film) Combat_Shock The_Killing_Fields_(film) A_Bright_Shining_Lie Apocalypse_Now_Redux Flight_of_the_Intruder Dead_Presidents R-Point The_Last_Hunter There_Is_No_13 Deceit_(2009_film) The_Ballad_of_Andy_Crocker Some_Kind_of_Hero The_Deer_Hunter A_Rumor_of_War_(miniseries) Platoon_(film) The_Crazy_World_of_Julius_Vrooder Thou_Shalt_Not_Kill..._Except 1969_(film) A.W.O.L._(2006_film) The_Siege_of_Firebase_Gloria Alamo_Bay Rolling_Thunder_(film) Query: Vietnam War Movie

  15. Original Query Result Stage 1 Result Stage 2 vietnamwarmovie'11-1------' '11--------' vietnamwarfacts '11--11111-' '11--11111-' vietnamfoodrecipes '------1---' '------1---' vietnamesefoodblog '----------' '----------' vietnam travel national park '1111111111' '1111111111' vietnam travel airports '-1-111----' '-1-11-1---' guitarchord tuning '111111111-' '111111111-' guitarchordminor '11111--11-' '11111--11-' guitarclassical flamenco '----1-----' '----1-----' guitarclassicalbach '1-11--11--' '1-11--11--' guitaroriginRussia '----------' '----------' guitarorigin blues '1-1-------' '1-1-------' tango culturemovies '---1--1---' '---1--1---' tango culturecountries '---1-1---1' '---1-1---1' tango musiccomposers '-1--------' '---1-1---1' tango music instruments '----------' '----------' tango dance styles '11--------' '----------' tango dancehistory '111-------' '111-------' bicycle sport races '111-1--1--' '---1111-1-' bicycle sport disciplines '----1-----' '----1-----' bicycleholiday nature '----------' '----------' bicyclebenefitshealth '------1---' '------1---' bicyclebenefitsenvironment '---------1' '---------1' female rock singers '1-------1-' '1-1--1--1-' south korean girlgroups '----------' '111111-111' electronicmusic genres '1-1-1-----' '-1-1------' digital musicnotation formats '-111-1111-' '-111-1111-' musicconferences'11111-1---' '--11111-1-' intellectualpropertyrights lobby '111-1-1111' '111-1-1111'

  16. Precision and Recall P R

  17. Literature Kaptein, R., Koolen, M., & Kamps, J. (2009, July). Using Wikipedia categories for ad hoc search. In Proceedingsofthe 32nd international ACM SIGIR conferenceon Research and development in informationretrieval (pp. 824-825). ACM. Vercoustre, A. M., Pehcevski, J., & Thom, J. (2008). Using wikipedia categories and links in entity ranking. Focused Access to XML Documents, 321-335. Illustration http://www.flickr.com/photos/pasukaru76/6196321318/

More Related