220 likes | 357 Vues
Discover how text mining techniques can enhance visitor experiences in museums. This seminar explores the application of mobile technology to assist museum-goers by providing personalized exhibit recommendations based on their interests and movement patterns. Utilizing data from the Melbourne Museum, we analyze factors like physical proximity, exhibit popularity, and textual similarities to predict visitor preferences. Our findings reveal that while popularity-based predictions perform best, understanding visitor history is crucial for creating meaningful connections between exhibits. Join us in unlocking the stories of the past!
E N D
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter
Domain • Museums offeringvastamountofinformation • But: Visitorsreceptivityand time limited • Challenge: seclecting (subjectively) interestingexhibits • Ideaof mobile, electronicehandheld, like PDA assistingvisitorby : • Deliveringcontentbased on observationsofvisit • Recommendexhibits • Non-intrusive ,adaptive usermodellingtechnologiesused
Predictionstimuli • Different stimuli: • Physicalproximityofexhibits • Conceptualsimularity (based on textualdescribtionofexhibit) • Relative sequenceothervisitorsvisitedexhibits (popularity) • Evaluate relative impactofthe different factors => seperatestimuli • Language basedmodelssimulatevisitorsthoughtprocess
Experimental Setup • Melbourne Museum, Australia • Largestmuseum in Southern Hemisphere • RestrictedtoAustralia Gallery collection, presentshistoryofcityof Melbourne: • PharLap • CSIRAC • Variation ofexhibits: can not classified in a singlecategory
Experimental Setup • Wide rangeofmodality: • Information plaque • Audio-visualenhancement • Multiple displaysinteractingwithvisitor • Here: NOT differentiatebetweenexhibittypesormodalities • Australia Gallery Collectionexistsof 53 exhibits • Topologyoffloor: open plan design => nopredeterminedsequencebyarchitecture
Resources • Floorplanofexhibitionlocated in 2. floor • Physicaldistanceoftheexhibits • Melbourne Museum web-siteprovidescorresponding web-pageforeveryexhibit • Dataset of 60 visitorpathsthroughthegallery, usedfor: • Training (machinelearning) • Evaluation
Predictionsbased onProximityandPopularity • Proximity-basedpredictions: • Exhibitsranked in order ofphysicaldistance • Prediction: closest not-yet-visitedexhibittovisitorscurrentlocation • In evaluation: baseline • Popularity-basedpredictions: • Visitorpathsprovidedby Melbourne Museum • Convertpathsintomatrixoftransitionalprobabilities • Zero probabilitiesremovedwithLaplaciansmoothing • Markov Model
Text-basedPrediction • Exhibitsrelatedtoeachotherbyinformationcontent • Every exhibits web-pageconsitsof: • Body oftextdescribingexhibit • Set ofattributekeywords • Predictionofmostsimilarexhibit: • Keywordsasqueries • Web-pagesasdocumentspace • Simple termfrequency-inverse documentfrequency, tf-idf • Score ofeachqueryovereachdocumentnormalised
WSD • Whymakevisitorsconnectionsbetweenexhibits ? • Multiple simularitiesbetweenexhibitspossible • Useof Word Sense Disambiguation: • Path ofvisitorassentenceofexhibits • Eachexhibit in sentencehasassociatedmeaning • Deteminemeaningofnextexhibit • Foreachword in keywordsetofeachexhibit: • WordNetsimilarityiscalculatedagainsteachotherword in otherexhibits
WordNetSimilarity • Similaritymethodsused: • Lin (measuresdifferenceofinformationcontentoftwotermsasfunctionofprobabilityofoccurence in a corpus) • Leacock-Chodorow (edge-counting: function of length of path linking the terms and position of the terms in the taxonomy) • Banerjee-Pedersen (Leskalgorithm) • SimilarityassumofWordNetsimilaritiesbetweeneachkeyword • Visitorshistorymaybeimportantforprediction • Latestvisitedexhibitshigherimpact on visitorthanfirstvisitedexhibits
Evaluation: Method • Foreachmethodtwotests: • Predictnextexhibit in visitorspath • Restrictpredictions, onlyifpred. overthreshold • Evaluation data, aforementiened 60 visitorpaths • 60-fold cross-validationused, forPopularity: • 59 visitorpathsastrainingdata • 1 remainingpathforevaluationused • Repeat thisfor all 60 paths • Combine theresults in singleestimation (e.gaverage)
Evaluation Accuracy: Percentageoftimes, occuredevent was predictedwithhighestProbability BOE: BagofExhibits: Percentageofexhibitsvisitedbyvisitor, not necessary in order ofrecommendation BOE is, in thiscase, identicaltoprecision Single exhibithistory
Evaluation Single exhibithistory withoutthresholdwiththreshold
Evaluation Visitorshistoryenhanced Single exhibithistory
Conclusion • Best performingmethod: Popularity-basedprediction • Historyenhancedmodelslowperformer, possiblereason: • Visitorshadnopreconceivedtask in mind • Movingfromoneimpressiveexhibittonext • Historyhere not relevant, currentlocationmoreimportant • Keep in mind: • Small dataset • Melbourne Gallery (historyofthecity) perhabsnogoodchoice
tf-idf • Term frequency – inverse documentfrequency • Term count=numberoftimes a giventermappears in document • Number n oftermt_i in doumentd_j • In larger documentstermoccursmorelikely, therefornormalise • Inverse documentfrequency, idf, measuresgeneralimportanceofterm • Total numberofdocuments, • Dividedbynrofdocscontainingterm
tf-idf: Similarity • Vectorspace model used • Documentsandqueriesrepresentedasvectors • Eachdimensioncorrespondsto a term • Tf-idfusedforweighting • Compare angle betweenquery an doc
WordNetsimilarities • Lin: • method to compute the semantic relatedness of word senses using the information content of the concepts in WordNet and the 'Similarity Theorem' • Leacock-Chodorow: • counts up the number of edges between the senses in the 'is-a' hierarchy of WordNet • value is then scaled by the maximum depth of the WordNet 'is-a' hierarchy • Banerjee-Pedersen, Lesk: • choosing pairs of ambiguous words within a neighbourhood • checks their definitions in a dictionary • choose the senses as to maximise the number of common terms in the definitions of the chosen words.
Precision, Recall Recall: Percentageof relevant documentswithrespecttothe relative numberofdocumentsretrieved. Precision: Percentageof relevant documentsretrievedwithrespectto total numberof relevant documents in dataspace.
F-Score • F-Score combines Precision and Recall • Harmonicmeanofprecisionandrecall