Using Digg to Discover Events and Save Lives: A Collaborative Approach
This presentation explores how Digg, a social web-media discovery tool, can be leveraged to find interesting events in real-time and identify critical situations that may require immediate attention. By utilizing the collaborative nature of social media and the Digg API, we can analyze vast amounts of user-generated data to extract meaningful insights. We discuss the challenges of preprocessing dirty data, clustering text documents, and optimizing criteria to achieve effective results. Our goal is to enhance data scalability and responsiveness in emergency scenarios.
Using Digg to Discover Events and Save Lives: A Collaborative Approach
E N D
Presentation Transcript
Using to Save Lives Or, Using Digg to find interesting events. Presented by: Luis Zaman, Amir Khakpour, and John Felix
Explanation • Digg is a social web-media discovery tool based on user submitted content. • 1 or 2 submissions a minute • Half-life of “interest” is about a day • Digg aggregates “interesting” content. • But how do we find interesting Events and know their Themes?
Motivation • Collaborative nature of Social Media can scour the WWW very thoroughly. • But, this generates A LOT of data (you’ll see). • It would be cool to find emergencies, or critical situations based on this collaborative media. • Apple seems like a pretty good starting point.
Preprocessing • Digg API • REST API • http://services.digg.com/stories/topic/apple?count=10 • XML response • <?xml version="1.0" encoding="utf-8" ?><users timestamp="1176998598" total="1" offset="0" count="1"> <user name="sbwms" icon="http://digg.com/img/user-large/user-default.png" registered="1135702996" profileviews="3104" /></users></xml> • Limitations • 100 results per request • 1 Hour of time series data • Can’t go fast, or else.
Preprocessing • Time Series • Each digg is the event (only 100 at a time) • Rows • Each story’s digg count • Columns • Every hour (2,207 of them from August 08 – November 08) • Clustering • Rows • Each story that was digged at any point in the time series • Columns • The words in the title and description of this story
Preprocessing - Challenges • SLOW • Really Dirty Data • Different Formats of Data • REALLY SLOW
Introduction to Document Clustering • Challenges of clustering of text documents unlike structured data are: • Volume • Dimensionality • Sparsity • Complex semantics • In information retrieval and text mining, text data is represented in a common representation model, e.g. Vector Space Model (VSM) • Huge sparse matrix, we just store non-zero values Text Text documents are converted to Am,n where for m documents and total number of n words (or phrases), each element xi,j represents the frequency of the jth term in the ith document.
Clustering • Dataset • Number of stories (m) : 25470 • Total number of unique words (n): 55557 • Nonzero values: 469323 (0.03214%) • Clustering using Cluto Software • Using Kmeans, bisecting Kmeans • Calculating Centroids and SSE • A C++ program is run on “black”
Document Clustering by Optimizing Criterion Functions • According to Zhao et .al, to have a good clustering for documents we can use some Criterion Function and use optimization to find clusters: • Internal Criterion Functions (I) • Maximizing the internal similarity function: • External Criterion Functions (E) • Minimizing the external similarity function: • Hybrid Criterion Functions (H) • Maximizing
Experiments • SSE for I (K-Means vs Bisecting K-Means)
Visualization • What we used • jQuery • Database query library for javascript • PHP/MySQL • Scripting language and database backend • Google Visualization API • Time Series Graph • Zoomable • Timepedia Chronoscope • Clickable
Conclusions • Success? • Of course we think so • Future Work • Save lives? • Better clustering • Cleaner data • More data • Make it scalable, and dynamic • On-line and on the fly?