1 / 50

Web 2.0 Data Analysis

Web 2.0 Data Analysis . Daniel Deutch. Data Management. “Data management is the development, execution and supervision of plans, policies, programs and practices that control , protect , deliver and enhance the value of data and information assets”

tawny
Télécharger la présentation

Web 2.0 Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web 2.0 Data Analysis Daniel Deutch

  2. Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets” (DAMA Data Management Body of Knowledge) A major success: the relational model of databases

  3. Relational Databases • Developed by Codd (1970), who won the Turing award for the model • Huge success and impact: • The vast majority of organizational datatoday is stored in relational databases • Implementations include MS SQL Server, MS excel, Oracle DB, mySQL,… • 2 Turing award winners (Edgar F. Codd and Jim Gray) • Basic idea: data is organized in tables (=relations) • Relations can be derived from other relations using a set of operations calledthe relational algebra • On which SQL is largely based

  4. Research in Data(base) Management • 1970- : Relational Databases (tables). • Indexing, Tuning, Query Languages, Optimizations, Expressive Power,…. • ~20 years ago: Emergence of the Web and research on Web data • XML, text database, web graph…. • Google is a product of this research (by Stanford’s PhD students Brin and Page) • Recent years: hot topics include distributed databases, data privacy, data integration, social networks, web applications, crowdsourcing,trust,… • Foundations taken from “classical” database research • Theoretical foundations with practical impact

  5. Web 2.0 • “Old” web (“Web 1.0”): static pages • News, encyclopedic knowledge... • No, or very little, interactive process between the web-page and the user. • Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user. • “Network as platform" computing • The “participatory Web”

  6. Web 2.0 • “Old” web (“Web 1.0”): static pages • News, encyclopedic knowledge... • No, or very little, interactive process between the web-page and the user. • Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user. • “Network as platform" computing • The “participatory Web”

  7. Online shopping

  8. Advertisements

  9. Social Networks

  10. Collaborative Applications

  11. Data is all around • Web graph • “Social graph” • Pictures, Videos, notifications, messages.. • Data that the application processes • Advertisments • Even the application structure itself

  12. (A small portion of) the web graph

  13. News Feed

  14. Need to Analyze • Huge amount of data out there • Est. 13.68 billion web-pages and counting • Half a billion tweets per day and counting • An average user “sees” about 600 tweets per day • Most of it is irrelevant for you, some is incorrect

  15. Maybe we shouldn’t

  16. Haaretz, 7/3/13

  17. Static vs. Dynamic data • For “static” web data: problem addressed by search engines • Allow users to specify criteria for search (e.g. keywords), and rank the results. • For “dynamic” web 2.0 data (tweets, notifications, web applications): • No satisfying solution yet. • Data volume is constantly increasing.

  18. Filter, Rank, Explain • Filter • Select the portion of data that is relevant • Group similar results • Rank • Rank data by trustworthiness, relevance, recency... • Present highest-rank first • Explain • An explanation of whyis the data considered relevant/highly-ranked • An explanation of howhas the data propagated • “Why do I see this?”

  19. Approach • Leverage knowledge from “classic” database research • Account for the new challenges • Do so in a generic manner • Leverage unique features such as collaborative contribution, distribution, etc.

  20. sname sid=sid cid=cid name=“Mary” Courses Takes Students Data model Query language Students Select… From… Where… Physical Storage Indexing Distribution ...

  21. Foundations • Model • Query Language • Query evaluation algorithms • Prototype implementation and optimizations • Getting Data and Testing

  22. Two concrete examples • Analysis of Web Applications • Analysis of Social Networks

  23. Analysis of Web Applications

  24. Analysis Questions • For the application owner: • Can the user make a reservation without giving credit card details? • What is the probability that a user would choose a category for which there are no available products? • For the application user: • How can I get the best deals (by navigating)? • How can I navigate in an optimal way (minimal number of clicks)

  25. ShopIT-Shopping Assistant

  26. ShopIT-Shopping Assistant

  27. ShopIT-Shopping Assistant

  28. Modeling • Model should be declarative, intuitive • Abstract representation • Allows readability, portability, applicability, optimizations… • Tradeoffbetween expressivity and complexity

  29. Application Model

  30. Effects guiding the execution • Dynamic effects such as the choice of which link to follow • Probability/cost value can be associated with every such choice • Cumulative: if the same choice is made twice, double the cost • Static effects such as available products in the database • Static in the sense that it is constant in a single execution • Thus non-cumulative

  31. Filter, Rank, Explain • Filter • Choose “interesting” navigation sequences • Rank • Rank different sequences by length, purchase cost,… • Explain • Why are these the best sequences? What do they entail in terms of purchase? How can they be refined?

  32. Query Language • Desired executions/navigation sequences are typically expressed in temporal logic • Database analysis is typically done through a First-order-based query language (e.g. SQL) • A query language for data-dependent application should combine both. • While allowing for efficient evaluation. • LTL-FO

  33. Query Evaluation • Problem: Given a process specification and a query, evaluatethe query • To find all executions that satisfy the criteria • To find the probability that the criteria are satisfied • To find the minimal cost execution realizing the criteria • …

  34. Interesting problems are hard (sometimes) • The (corresponding boolean) problems are NP-hard • Very unlikely that they can be solved efficiently for worst case input • This is often the case when modeling real-life problems • To be considered, but not to be feared!

  35. What do you do with a theoretically hard problem? • Find restricted, yet realistic, cases that are “easy”, i.e. admit efficient worst case algorithms • This also allows to have a better understanding of “where is the hardness” • Example: If there are only “few” static choices, the problem becomes easy • Design optimization techniques for practically efficient implementations

  36. Optimization • One idea is based on the notion of data provenance • Originally for “classic” database queries, provenance is a “concise” representation of the query computation • Here we generalize the approach for the temporal queries • Allows for a generic approach and many optimizations • A system prototype for provenance-aware analysis of processes is under development

  37. How do we get data? • Some applications (e.g. Yahoo! Shopping) allows a (limited)interface to their database. • Application structure can be obtained by crawling. • Allows to prove that the model is realistic, and to test feasibility of algorithms on real-life data scale

  38. Open problems • Uncertainty and partial knowledge • Data updates by the application • Incremental maintenance • How do modifications to the database can be accounted for without restarting the analysis?

  39. Analysis of Social Networks • Social network common model of disseminating data is in a “push” mode • Facebook’s news feed, Twitter’s tweets… • Huge load of information, and it continues to grow • No standard way of searching and ranking such posts

  40. A Machine Learning Approach • “Learn” users preferences • Ask them whether they like a particular post/tweet, or use existing mechanisms • Try to generalize by extracting characteristic features (e.g. topics) • Use the generalized function to predict future preferences

  41. A data management approach • Allow users to tell the system their preferences, through a “query” language. • One family of such languages are rule-based languages • Prolog, Datalog,... • Simpler than standard programming languages • Allows for easier analysis

  42. Rule based language(Logic Programming) descendent(X,Y):- parent(X,Y) descendent(X,Y):- parent(X,Z),descendent(Z,Y)

  43. For tweeter See(tweet):-Tweets(Tova,tweet) See(tweet):-Tweets(x,tweet), related(x,"basketball") Attending(party):- count[friend(x),like(x,party)]>10

  44. Learning • Unlike standard database querying, some of the predicates may be difficult to evaluate • How do we know that a post is related to basketball? • What can we do?

  45. What can we do • Text analysis • Problem: An ”I love basketball” tweet is not really a good match • Use tagging if exists • Ask other peers in the network! • Can also be used to propose predicates to users

  46. Influence • Predict effects and allow users to influence friends • "How do I get at least 5 of my friends to come with the party with me?" • Need them to (1) see the party, (2) click "attending" • "How do I get at least 1000 people to go to a political rally?" • Can be answered by an analysis of the rules!

  47. Interface for Specifying Rules • Cannot expect Facebook/Twitter users to be able to program • Need to design an interface • The interface components can be dynamically designed according to answers by other peers • An ongoing research

  48. Conclusions • Web 2.0 data poses new challenges in research • Data is dynamic, access mode is unique, analysis goals different than in classic databases • Lots of research on theoretical questions with strong practical applications

  49. Questions?

More Related