1 / 35

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al. Pete Bohman Adam Kunk. Outline. Introduction Related Work System Overview Indexing Scheme Ranking Evaluation Conclusion. Real-Time Search. R equirements

keagan
Télécharger la présentation

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TI: An Efficient Indexing Mechanism for Real-Time Search on TweetsSIGMOD ‘11C. Chen et al Pete Bohman Adam Kunk

  2. Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion

  3. Real-Time Search • Requirements • Contents searchable immediately following creation • Scale to thousands of updates/sec OBL Death 5,000 tweets/sec • Results relevant to query via cost efficient ranking • Tradeoff: • Scalability and Performance vs. Ranking

  4. Real-Time Search • Applications • The ability to receive updates as they occur • Applicability • It may not be feasible to provide real-time search results in a system with thousands of new entries per second

  5. TI: Tweet Index • TI is an indexing and ranking mechanism for real-time search in microblogging systems, such as Twitter. • In order for TI to return real-time results, only some of the tweets are indexed immediately (distinguished tweets), and the others are handled periodically (those deemed not as important, noisy tweets).

  6. Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion

  7. Partial Indexing • The Case for Partial Indexes • Stonebreaker, 1989 • Index only a portion of a column • User specified index predicates (where salary > 500) • Build index as a side-effect of query processing

  8. View Materialization • An application of materialized views is to use cost models to automatically select which views to materialize. • Materialized views can be thought of as snapshots of a database, in which the results of a query are stored in an object. • The concept of only indexing essential tweets in real-time was borrowed from the idea of view materialization.

  9. Microblog Search • Google and Twitter have both released real-time search engines. • Google’s engine adaptively crawls the microblog • Twitter’s engine relies on Apache’s Lucene (high-performance, full-featured text search engine library) • But, both the Google and Twitter engines only utilize time in their ranking algorithms. • TI’s ranking algorithm takes much more than just time into account.

  10. TI Cost Reduction • TI clusters similar tweets together and offloads noisy tweets in order to reduce computation costs of real-time search. • Tweets are grouped into topics by grouping them by relationship in a tree structure. • Tweets replying to the same tweet or belonging to the same thread are organized as a tree. • TI also maintains popular topics in memory.

  11. Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion

  12. TI Architecture

  13. User Graph • Twitter users have links to other friends • A User Graph is utilized to demonstrate this relationship • Gu= (U, E) • U is the set of users in the system • E is the friend links between them

  14. Tweet Tree Structure • Nodes represent tweets • Directed edges indicate replies or retweets • Implemented by assigning tweets a tree encoding ID

  15. TI Design • Search is handled via an inverted index for tweets • Given a keyword, the inverted index returns a tweet list, T • T contains set of tweets sorted by timestamp

  16. TI Inverted Index • TID = Tweet ID • U-PageRank= Used for ranking • TF = Term Frequency • tree = TID of root node of tweet tree • time = timestamp

  17. Ranking Support • In order to help ranking, TI keeps a table of metadata for each tweet • TID = tweet ID • RID = ID of replied tweet (to find parent) • tree = TID of root node of tweet tree • time = timestamp • count = number of tweets replying to this tweet

  18. In-memory structures • Certain structures are kept in-memory to support indexing and ranking • Keyword threshold – records statistics of recent popular queries • Candidate topic list – information about recent topics • Popular topic list – information about highly discussed topics

  19. Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion

  20. TI Indexing Overview • TI categorizes tweets as either being distinguished or noisy • Distinguised: real-time indexing scheme • Noisy: background batch indexing scheme • As a new tweet is entered, its content is analyzed and in order to categorize the tweet as one of the above two types.

  21. TI Inverted Index

  22. Real-Time Indexing • New tweets categorized as being distinguished (index these immediately) • If tweet belongs to existing tweet tree, retrieve its parent tweet to get root ID and generate encoding. Update count number in parent. • Tweet is inserted into tweet data table. • Tweet is inserted into inverted index. • Main cost is updating the inverted index (due to each keyword in the tweet).

  23. Batch Indexing • New tweets categorized as being noisy (index these at a later time) • Instead of indexing in inverted index, append tweet to log file. • Batch indexing process periodically scans the log file and indexes the tweets there.

  24. Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion

  25. Ranking Desiderata • “The ranking function must consider both the timestamp of the data and the similarity between the data and the query.” • “The ranking function is composed of two independent factors, time and similarity.” • “The ranking function should be cost-efficient.”

  26. Ranking Overview • Ranking functions are completely separate from the indexing mechanism • New ranking functions could be used • TI’s proposed ranking function is based on: • User’s PageRank • Popularity of the topic • Timestamp (self-explanatory) • Similarity between tweet and the query

  27. User’s PageRank • Twitter has two types of links between users • f(u): the set of users who follow user u • f-1(u): the set of users who user u follows • A matrix, Mf[i][j], is used to record the following links between users • A weight factor is given for each user • V = (w1, w2, ….. wn)

  28. User’s PageRank Formula • PageRank formula is given as: Pu = VMfx • So, the user’s PageRank is a combination of their user weight and how many followers they have • The more popular the user, the higher the PageRank

  29. Popularity of Topics • Users can retweet or reply to tweets. • Popularity can be determined by looking at the largest tweet trees. • Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree

  30. Similarity between query and tweet • The similarity of a query and the tweet t can be computed as follows: sim(q,t) = (q x t) / (|q||t|)

  31. Ranking Function • q.timestamp = query submittal time • tree.timstamp = timestamp of tree t belongs to (timestamp of root node) • w1, w2, w3 are weight factors for each component (all set to 1)

  32. Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion

  33. Evaluation

  34. Outline • Introduction • Related Work • System Overview • Indexing Scheme • Ranking • Evaluation • Conclusion

  35. Conslusion

More Related