60 likes | 174 Vues
?Miscellaneous Group ¿ What we learned in Compiegne. Sven Abels, David Parry, Katarzyna Wegrzyn-Wolska, Wai Gen Yee. Contents. Abels: Splitting compounds Parry: Attribution Wegrzyn-Wolska: Web page lifetimes. Yee: P2P information retrieval. jWordSplitter. Usage of "Bloom Filters"?
E N D
?Miscellaneous Group¿What we learned in Compiegne Sven Abels, David Parry, Katarzyna Wegrzyn-Wolska, Wai Gen Yee
Contents • Abels: Splitting compounds • Parry: Attribution • Wegrzyn-Wolska: Web page lifetimes. • Yee: P2P information retrieval.
jWordSplitter • Usage of "Bloom Filters"? • to test whether or not an element is a member of a set strong space advantage compared to hash tables • Connect words instead of splitting them? • Needed e.g. in China (Google vs. Baidu) • Reduction of dictionary to atomic words? • To further improve size of dictionary and checking time • Consideration of further language specific rules • For words that might need a grammatical change after the decomposition • Next: Evaluation of improvement • By using two projects as introduced in the presentation
Attribution • K-distance has some similarity to n-grams, but the compression algorithms give more flexibility. • Location of centroids for clustering can be simplified—this may make clustering via this approach more practical. • The work in computational linguistics for author identification is related. • Use of the compression dictionary directly, may allow comparison between dictionaries rather than the “black box” approach.
Lifetimes of Web Pages • The lifespan, accessibility and archiving of dynamic documents • Why is this problematic? • Interesting comments and questions: • measuring the lifespan of dynamic documents and its interest for the Search Engines. • definition of the lifespan, where the page can be consider as a new one.
Peer-to-Peer IR • Reputations of peers. • Identify spoofers, spam. • Expand the model: • Development of P2P Googles.