1 / 13

Twitterpedia

Twitterpedia. Visualization Lab By: Thomas Kraft. Overview. Current State. Future. Problem. What is being talked about and where? Twitter has massive amounts of data Tweets are unstructured Goal: Quickly identify current events / topics on a large scale. Overview. Current State.

royal
Télécharger la présentation

Twitterpedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Twitterpedia Visualization Lab By: Thomas Kraft

  2. Overview Current State Future Problem • What is being talked about and where? • Twitter has massive amounts of data • Tweets are unstructured • Goal: Quickly identify current events / topics on a large scale

  3. Overview Current State Future What Needs To Be Done • Data Collection • Database • Web Crawler • Analyze Data • Topic Modeling • Get Trends and topics!

  4. Overview Current State Future Hadoop • Processes large datasets • Splits data into chunks • Data processed on multiple machines • Very Scalable • Add/remove computers easily • As dataset grows so can # of machines

  5. Overview Current State Future Computer Cluster

  6. Overview Current State Future Topic Modeling • Latent Dirichlet Allocation (LDA) • Correlations between words in topics • Topics composed of keyword groups • Tweets topic can effectively be inferred

  7. Overview Current State Future June 26, 2011 “Bruno & alicia! I love it!” “Can Rick Ross Please put his clothes on?”

  8. Overview CurrentState Future Challenge • Topic Modeling Resource Intensive • Iterates over data • Single Computer can’t handle large dataset • Solution: Parallelize the process

  9. Overview Current State Future Parallel - LDA • Write algorithm to split up tweets and join output • Improves scalability for LDA • Shows near linear improvements • PLDA will take twitterpedia to next level • Larger datasets with quicker processing

  10. Overview Current State Future Future • Write algorithm to parallelize tweet distribution and aggregation • Create website implementing topics

  11. Overview Current State Future Conclusion • Working on this project has been a great learning experience • Designed and managed a large database • Efficiency high priority • Learned cool tricks along the way…

  12. Overview Current State Future Thanks • A Special thanks to my advisor Xiaoyu Wang, Wenwen Dou, and to the Visualization Center • Thomas Kraft : tbkraf08@stlawu.edu

  13. Questions?

More Related