270 likes | 420 Vues
An Architecture for Emergent Semantics. Sven Herschel, Ralf Heese , and Jens Bleiholder Humboldt-Universität zu Berlin/ Hasso-Plattner-Institut. Ideas of Emergent Semantics. Improve document representation by aggregating many users’ opinions
E N D
An Architecture for Emergent Semantics Sven Herschel, Ralf Heese, and Jens Bleiholder Humboldt-Universität zu Berlin/Hasso-Plattner-Institut
Ideas of Emergent Semantics • Improve document representation • by aggregating many users’ opinions • Adding keywords implicitly whilequerying the corpus • Living document representationinstead of query reformulation • Entirely new keywords • Immediate change of thedocument representation andof the corpus index User query IR Query Engine corpus/ doc repr. Information Retrievaltoday S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Outline • Basement (Background) • Construction (Architecture of Emergent Semantics) • Assessment (Evaluation) • Roof and Windows (Conclusion and Future Work) S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Information Retrieval • Information Retrieval • Content-oriented search on a set of documents • Find an document representation to retrieve documents effectively and efficiently according to the user’s query • Today's approaches • Capture the semantics of a document by analyzing syntactic information • No new words in document representation • Synonyms cannot be added • Query refinement Basement Construction Assessment Roof and Windows S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
current IR approaches emergent semantics Semiotic signs signs signs represented object signs user interpretation Basement Construction Assessment Roof and Windows Syntax t r e e Semantics A tall perennial woody plant … Pragmatics A figure that branchesfrom a single root … http://www.wordreference.com/definition/tree S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
t1 tn t2 tn ! ? Components of Emergent Semantics corpus/doc repr. know- legde Query Engine Interpreter Basement Construction Assessment Roof and Windows 1 2 Retrieval Engine Ranking Function 3 AnnotationFilter 4 Quality Measure S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
t1 tn t2 tn Bootstrapping corpus/doc repr. know- legde Basement Construction Assessment Roof and Windows Index the document corpus,e.g., TF/IDF, Latent Semantic Indexing S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
t1 tn t2 tn ? Receiving a Query corpus/doc repr. know- legde Interpreter Basement Construction Assessment Roof and Windows 1 Reformulate the query,e.g., query expansion, replacing terms S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
t1 tn t2 tn ? Query Evaluation corpus/doc repr. know- legde Query Engine Interpreter Groundwork Construction Assessment Roof and Windows 1 2 Retrieval Engine Ranking Function Select documents according to the query,e.g., inverted index of all terms Rank the list of matching documents,e.g., vector space model S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
t1 tn t2 tn ! ? Query Result corpus/doc repr. know- legde Query Engine Interpreter Basement Construction Assessment Roof and Windows 1 2 Retrieval Engine Ranking Function 3 The user determines the set of relevant documentsby evaluating the document surrogates. S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
t1 tn t2 tn ! ? Feedback corpus/doc repr. know- legde Query Engine Interpreter Basement Construction Assessment Roof and Windows 1 2 Retrieval Engine Ranking Function 3 AnnotationFilter Idea: Document is found by query terms and Document is marked as relevant All query terms are related to the document 4 Quality Measure The user retrieves the relevant documents. Add the original query to the document representation S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
t1 tn t2 tn ! ? Emergent Semantics Architecture corpus/doc repr. know- legde Query Engine Interpreter Basement Construction Assessment Roof and Windows 1 2 Ranking Function Retrieval Engine 3 AnnotationFilter 4 Quality Measure Syntax Pragmatics Semantics What do I mean by my query? How do most users formulate this query? How is the corpus queried? S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
? ! Example – Querying the document corpus • TF/IDF matrix of the document corpus • RDBMS does not occur in the document corpus • QueryQ = {RDBMS, SQL, language} • Ranked resultDQuery = (d1, d5, d2, d10)Drelevant = {d1, d2} Basement Construction Assessment Roof and Windows doc repr. Query Engine TF/IDF: weight = (term freq ∙ #doc) / doc freq S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Example – Adding the query terms • Adding {RDBMS, SQL, language} to document representation • Recalculation of the TF/IDF matrix necessary Basement Construction Assessment Roof and Windows AnnotationFilter Recalculation for keyword: language Recalculation for keyword: SQL Recalculation for keyword: RDBMS S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Living Document Representation • Document representations change over time (living document representation) • Many similar queries weights of the query terms increase • Unrelated query terms document representation changes only slightly • New keywords / semantic concepts in document representation Basement Construction Assessment Roof and Windows Documentrepresentations Query S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Experiment I - Setup • CACM corpus • 3200 documents + 32 queries + gold standard • Title and abstract tokenized and indexed using Apache Lucene • Retrieval and Ranking • Vector space model with TF/IDF weights • Feedback • Attach the tokenized query to all relevant document representations Basement Construction Assessment Roof and Windows S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Exploit corpus correlations • Split the set of queries into halves • Run first half and feed back all query terms • Run second half Basement Construction Assessment Roof and Windows Run query set 1 Identical to TF/IDF without EmSem Small overlap between queries Small overlap between result sets Add query terms to relevant document representation Run query set 2 Run query set 1 Measure again (1st EmSem run)) Run query set 2 Add query terms again … S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Feeding back all query terms • Run all queries and feed back all query terms Groundwork Construction Assessment Roof and Windows S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Experiment II - Setup • First phase • Presented a wide variety of images to users • Which keywords would you use to find the image with a search engine? • Second phase • Rate the adequacy of the annotations Basement Construction Assessment Roof and Windows S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Results Weihnachtsmann 26.5% Brille 7.8% Nikolaus 7.8% Weihnachten 6.5% Santa Claus 6.0% Phase 1 Groundwork Construction Assessment Roof and Windows % Users terms Phase 2 Weihnachtsmann 100.0% Brille 51.8% Nikolaus 91.6% Weihnachten 61.5% Santa Claus 75.0% % users S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Conclusions from our Experiments • Document representations become more precise over time. • A small number of terms describe an image sufficiently. • A large number of user queries can be satisfied by indexing a small number of terms. Basement Construction Assessment Roof and Windows S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics
Roof and Windows • Architecture for emergent semantics • Users’ individual pragmatics aggregated into representation of documents • Living document representation Outlook • Applying EmSem to distributed IR • Reducing the size of document representations • Less network traffic Basement Construction Assessment Roof and Windows S. Herschel, R. Heese, and J. Bleiholder: Emergent Semantics