CS910: Foundations of Data Analytics

CS910: Foundations of Data Analytics Graham CormodeG.Cormode@warwick.ac.uk Case Studies

Case Studies • 4 papers on data analytics published in the scientific literature: • I Tube, You Tube, Everybody Tubes: Analyzing the World’sLargest User Generated Content Video SystemInternet Measurement Conference 2007 • What is Twitter, a Social Network or a News Media?19th international conference on World wide web 2010 • Meme-tracking and the Dynamics of the News CycleKnowledge Discovery and Data Mining (KDD), 2009 • Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge2011 International Joint Conference on Neural Networks (IJCNN) CS910 Foundations of Data Analytics

Details on Case Studies • Full details and links to the papers on course web sitewww2.warwick.ac.uk/fac/sci/dcs/teaching/material/cs910/ • Please read the papers in detail to get the full story • Bias: papers are from Computer Science research community • Mostly address data analysis applied to large websites • The most well-studied example of “Big Data” • Examples should be familiar to you (YouTube, Facebook, Twitter) • Objectives for the case studies: • To see examples of data analytics in practice • To introduce and motivate topics we will study in more detail later • To see examples of going from data to insight to understanding CS910 Foundations of Data Analytics

Case Study 1: Online video “I Tube, You Tube, Everybody Tubes: Analyzing the world’s largest user generated content video system” By Meeyoung Cha, HaewoonKwak, Pablo Rodriguez, Yong-YeolAhn, Sue Moon (Telefonica Research and KAIST) Published in Internet Measurement Conference 2007 http://conferences.sigcomm.org/imc/2007/papers/imc131.pdf CS910 Foundations of Data Analytics

Objectives To understand the impact of video sharing systems To study the popularity life-cycle of videos Study Statistical properties of requests, relation to video age Study prevalence of copying activities Understand potential for caching to save bandwidth CS910 Foundations of Data Analytics

Data Collection • Crawled YouTube and Daum sites in 2007 • Wrote programs to automatically collect data about all videos • YouTube was already very large in 2007 • Restricted crawl to ‘Entertainment’ and ‘Science/Tech’ categories • Collected data on each video: • Fixed: Uploader id, date of upload, duration of video • Variable: #views, #total ratings, #positive ratings, links to • Daily crawl for 6 days to see changes CS910 Foundations of Data Analytics

Video popularity distribution • Plot what fraction of views are outside the top videos • Normalize ranks from 0 to 100, to allow comparison • Top 10% of videos account for 80% of views • Very skewed distribution • Wide variation in popularity. Why? CS910 Foundations of Data Analytics

Understanding video popularity distribution “Skewness” (“the long tail”) is a common phenomena in data Observed by plotting data on a log-log scale: straight lines Plot views on x-axis, #videos with more than x views on y-axis CS910 Foundations of Data Analytics

0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 1 2 3 4 5 6 10 10 10 10 10 10 10 0 10 -1 10 -2 10 -3 10 -4 10 -5 10 -6 10 0 1 2 3 4 5 6 10 10 10 10 10 10 10 Modeling Skewness CS910 Foundations of Data Analytics Several distributions generate skew • “Power law” (Pareto, zipf): y proportional to x-a for some a • Gives straight line on log-log plot • “Power law with exponential cut-off”: y proportional to x-ae-bx • x<1/b: behaves like power law • x>1/b: behaves as exponential decay • Log-normal: taking log of distributionproduces a Normal curve How to tell which we are seeing?

Fit a curve • Find the best fitting curve for each model • A regression problem (covered later) • Log-normal captures behaviour for popular videos (head) • Power law with exponential cutoff seems best for tail • Why? CS910 Foundations of Data Analytics

Possible explanations Several mechanisms are known to generate long tail distributions: • Preferential attachment: popular items are most likely • Describes the main behaviour, but not the truncated tail • Aging effect: “old” items eventually die, receive no new activity • Does not fit videos: no ‘death’ or ‘removal’ of an old video • Information filtering: a user can only view a fixed number • Does not fit: people can keep watching new videos • Fetch-at-most-once: a user can view each video at most once • Better: people prefer to watch new videos, don’t watch top-10 over and over • Some exceptions: music fans watch favourites many times CS910 Foundations of Data Analytics

Validating ‘fetch-at-most-once’ • Simulate preferential attachment + fetch-at-most-once • R requests per user • U different users • V different videos • Observations from simulation • Increasing R sharpens tail • Increasing U shifts graph • Shape doesn’t change much • Do you agree? CS910 Foundations of Data Analytics

Effect of time • Views increase over time • Truncation gets sharper over time • Many possible reasons: e.g. more push to most popular content CS910 Foundations of Data Analytics

Impact of age Popularity does not vary strongly by age Most recent videos are slightly more popular Data is from early days in YouTube: have things changed? CS910 Foundations of Data Analytics

Can we predict future popularity? • Consider current popularity (views) and age (time since upload) • Does this correlate with future popularity? • Table shows correlation coefficient (number of videos sampled) • Strong correlation of instant popularity with future popularity • From day 2. Day 3 does not change much. CS910 Foundations of Data Analytics

Use these observations to cache • Streaming video uses a lot of Internet bandwidth (up to 66%?) • Could we cut bandwidth usage by running a cache? • E.g. put a video cache for all of Warwick University • How to fill the cache? • Static: Pick the most popular items once and for all • Dynamic: Initialize with most popular, then cache all new videos • Unrealistic, but a point of comparison • Hybrid: Static + daily most popular CS910 Foundations of Data Analytics

Content Copying • Much content on YouTube is copied from original uploaders • Re-uploaded by other users • How would you detect when this happens? • Here: picked 216 popular videos, asked volunteers to look • Manually found 1224 copies of 184 of original videos CS910 Foundations of Data Analytics

Reflections on the paper • A widely referenced paper from early in YouTube’s history • 900+ citations in the literature • Characterized many aspects of video viewing behaviour • And attempted to explain many of these • Many other plots in the paper • Video on the Internet has changed a lot since 2007 • Changes to YouTube website structure • Huge growth in mobile devices • Videos with billions of views • Do conclusions still hold? What other phenomena emerge? CS910 Foundations of Data Analytics

Case Study 2: Microblogging • “What is Twitter, a social network or a news media?” • Kwak, Lee, Park, Moon (KAIST), in WWW conference 2010 • http://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdf • Objectives: • “To study the topological characteristics of Twitter and its power as a new medium of information sharing” • “The first quantitative study on the entire Twittersphere and information diffusion on it” CS910 Foundations of Data Analytics

One slide on Twitter • Messaging service where messages are up to 140 characters • Broadcast (by default) to all other users • @useridaddresses a particular user • #hashtagto tag a message • RT: re-tweet someone else’s message (sometimes with comment) • Users “follow” another, and receive that user’s messages CS910 Foundations of Data Analytics

Data Collection • User Profiles • Crawl name, location, timezone, number of tweets • Begin at a popular user to crawl the “giant connected component” • Twitter limits to 20,000 requests an hour • Using 20 machines with different IP addresses, took 24 days • Tweets • Collected all tweets mentioning current “trending topics” • Probed every 5 minutes • Used Twitter search API to get up to 1500 tweets per query • Collect text, author, timestamp • Removed “spam tweets” from very new users CS910 Foundations of Data Analytics

Follower/following analysis CCDF: Fraction with more than this number • Asymmetry between followers and following • Some bumps can be explained with domain knowledge • Following 20: Twitter suggests an initial set of 20 to follow • Following 2000: used to be a limit of 2000, since removed • Fits power law, with exponent 2.28 (quite skewed) CS910 Foundations of Data Analytics

Followers and Tweets • Group the users based on number of followers • Compute the mean and median number of tweets for each count • Group into larger bins and plot median per bin as dashed line • Growth in activity correlated to number of followers • Correlation, not causation: which causes which? CS910 Foundations of Data Analytics

Degree of separation • 2 users are “friends” if there is a mutual following relationship • Study the friendship distance from random starting points • 80% are within 4 steps (compare to “6 degrees of separation”) CS910 Foundations of Data Analytics

Proximity of users • Are friends geographically close? How would you test this? • Look as average time difference between friends • As a function of number of friends CS910 Foundations of Data Analytics

Importance of Twitter users • Who are most important/influential Twitter users? • F: Count followers? Too crude? • R: Most retweeted? • PR: Most critical in the follower-following graph? • Use PageRank: defined measure importance of web pages • Recursive definition: PageRank of a node is sum of PageRanks of its followers • Can be computed efficiently (see later) • Compare top-20 in each: they appear similar • Compute rank-correlation between top-k CS910 Foundations of Data Analytics

Comparison of importance measures Followers (F) and PageRank (PR) give most similar ranking ReTweets (RT) are also correlated, but more weakly so CS910 Foundations of Data Analytics

Spread of influence CCDF: Fraction with more than this number • ReTweets can spread a message far and wide • Figure shows retweets of messages about a plane crash • Most trees are shallow (<3 hops) CS910 Foundations of Data Analytics

Retweet time Distribution of delay from initial tweet to retweet Inter-hop delay through retweet trees What do these plots tell you? CS910 Foundations of Data Analytics

Reflections on the paper • See also “A few chirps about twitter”, Krisnamurthy, Gill, Arlitt • From 2008, early in Twitter’s history • Instructive to compare the approach and the findings • Shows that there are many ways to slice and dice data from a relatively “simple” data source • Did not even look much at content of tweets • Widely cited (2400+ citations) as early work on Twitter CS910 Foundations of Data Analytics

Case Study 3: Meme-tracking • “Meme-tracking and the Dynamics of the News Cycle” • Jure Leskovec, Lars Backstrom, Jon Kleinberg (Stanford, Cornell) • International Conference on Knowledge Discovery and Data Mining (KDD), 2009snap.stanford.edu/class/cs224w-readings/leskovec09meme.pdf • Objectives: • Track short distinctive phrases that travel through the web • Use this to study the “news cycle” in the news media CS910 Foundations of Data Analytics

Data Collection and Preparation • News and blog activity from August 1 to October 31 2008 • 90million documents from 1.65 million sites • Used Spinn3r API to collect: see spinn3r.com • Extracted 112 million phrases in “quotes” • Discard those with < 4 words or seen < 10 times (uninteresting) • Discard those where > 25% occurrences from same source (spam) • Leaves 47 million occurrences of phrases • Collect phrases into clusters, based on overlap • Consider two phrases linked if they differ by at most 1 word • Partition the induced “phrase graph” to isolate key phrases • A long speech may have several key phrases in it • Quite detailed process; see paper for details CS910 Foundations of Data Analytics

Phrase distribution • For each volume, plot # of phrases with at least that volume • For all phrases, clusters of phrases, and phrases about “lipstick on a pig” (largest phrase cluster) CS910 Foundations of Data Analytics

Most important threads • Thread is all articles containing a phrase from a cluster • Plot is automatically generated and labeled to show volume CS910 Foundations of Data Analytics

Formulating a model • Advanced data analytics: propose a new model to explain data • Try to capture major effects, and neglect minor points • Imitation: sources imitate/copy each other • Recency: news cycle dominated by recent events • Model: simulate discrete time steps: newssources report threads • New thread produced at each step • At time t, each source picks thread j with probability f(nj)d(t-tj) • nj: number of sources reporting on thread j [“Imitation”] • d(t-tj) : decay factor based on age of thread j [“Recency”] CS910 Foundations of Data Analytics

Validating the Model Imitation-only Recency-only • Simulation: pick f() as power-law and d() as exponential decay • Generates synthetic data that looks similar to real • Can we more rigourously validate the model? CS910 Foundations of Data Analytics

Thread popularity over time • Popularity of a thread spikes at the median time of reports • Plot median related volume of 1000 threads around peak time • Quite symmetric, but faster decay than buildup • 8 hours away from peak, modeled by exponential function • Close to peak, modeled by sharper function log(1/|t|) CS910 Foundations of Data Analytics

Reflections on the paper • A very innovative approach to the question • Created from scratch a way to think about “memes” • Proposed new models, and gave some evaluation of them • Tackled a timely question and used compelling examples • Plots and figures illustrate the key points • Widely cited (600+ citations) • But - models not robustly evaluated CS910 Foundations of Data Analytics

Case Study 4: Link prediction • “Link Prediction by De-anonymization: How we won the Kaggle Social Network Challenge” • Narayanan (Texas), Shi (Berkeley), Rubinstein (Microsoft) • International Joint Conference on Neural Networks, 2011http://arxiv.org/abs/1102.4374 • Objectives: • Correctly predict whether two users in a network would form a link • Use additional background information to improve the results CS910 Foundations of Data Analytics

The Competition Kaggle hosts competitions for data analytics • Hosted the 2011 IJCNN Social Network Challenge in late 2010 • Provided a graph drawn from a social network • Nodes correspond to users in the network • (Directed) edges indicate a following relationship • Evaluation: determine whether a set of test edges truly occur • It was later disclosed that the graph came from Flickr • Edges are (directed) “friendship” relations between users CS910 Foundations of Data Analytics

Link Prediction • Goal of “link prediction” is to determine which new links will form, given current state of the graph • Many factors can be taken into account: • Properties of the nodes • Existing number of links • Common neighbours between a pair • Graph distance between the pair • This work used an additional factor: • Try to match nodes in data set to their own data collection CS910 Foundations of Data Analytics

Data Collection • Competition Data: Kaggle • 1.1M nodes, 7.2M edges provided as main data • 8960 “test edges”: 50% true edges (removed from main data) • 20% of test set held back by Kaggle to evaluate the results • Competitors Data: Flickr • Crawled Flickr social graph (used Python + Curl library) • 2M nodes and all outgoing edges crawled • Total of 9.1M nodes, with 163M edges (much bigger data set) • Evaluation: Area Under the Curve (AUC) • Values are True/False, predicted as Positive/Negative CS910 Foundations of Data Analytics

Evaluating Results • False positive rate: FP/(FP + TN) • What fraction of False values are reported as positive • True positive rate (aka sensitivity, recall): TP/(TP+FN) • What fraction of True values are reported as positive? • Precision: TP/(TP+FP) • What fraction of those reported positive are correct? • Accuracy: (TP+TN)/(TP+TN+FN+FP) • F1 Measure: 2TP/(2TP + FP + FN) • Harmonic mean of precision and recall CS910 Foundations of Data Analytics

ROC and AUC • Assume each prediction has a confidence p between 0 and 1 • Can pick a threshold r, round all confidences < r to 0, > r to 1 • For each choice of r, we get a true positive and false positive rate • Different choices of r give a different tradeoff • Receiver Operating Characteristic (ROC) curve: • Closer to top-left is better • Area Under Curve (AUC)measures overall quality • Compute AUC by stepping through sorted confidence values • Random guessing gives diagonal line • AUC is 0.5 CS910 Foundations of Data Analytics

Degree distribution • Each node has an in-degree and an out-degree • Skewed distribution, few nodes have high in/out-degree • Can try to use these as “landmarks” CS910 Foundations of Data Analytics

Seed identification • Try to find match a few nodes between the two graphs • Look at nodes with high in-degree (pointed to by many) • These are likely to be present in both graphs • Because of the crawling process • Pick highest n (20) degree nodes from Kaggle (K) and Flickr (F) • Try to match them up to get a “seed” matching • For a pair of nodes v, w in K (F), compute their “cosine similarity”: • #common neighbours(v,w)/√(#neighbours(v)*#neighbours(w)) • Find best matching of nodes in K and F based on cosine similarities • Initially: manually • Later: optimization problem (see OR and optimization) CS910 Foundations of Data Analytics

Propagation • Now have matched n nodes between Kaggle and Flickr graphs • Maintain a matching, and try to extend based on neighbors • Find pairs of nodes in Kaggle and Flickr whose similarity is high • Extend the matching. Iterate. • Some heuristics to accept a new pair into matching: • Must be at least 4 mapped common neighbors • Cosine-similarity score must be at least 0.5 • Difference in similarity scores between best, and second best must be at least 0.2 CS910 Foundations of Data Analytics

Results • Using “ground truth” information: • After 120,000 mappings in first stage, 99.3% correct • After second stage, had mappings for 14K out of 17.6K in test set • Overall accuracy for matched nodes 97.8% • A coverage of 57% of edges: still need to give answer for rest • For test edges, accuracy was 95% • Use inferred information to predict links for more of test set • Look at all possible candidates for node pair in Flickr • Use these to vote on whether the edge is present or not • Accept if unanimous vote • Covers a further 19% of test edges CS910 Foundations of Data Analytics

Machine Learning • Leaves 24% of test edges without a mapping to Flickr • Apply Machine Learning (the original goal of the challenge) • Create a number of “features” for each edge • In-degree and out-degree of node • Whether reverse edge exists • Measures of local graph (number of common neighbors etc.) • Train a “classifier” : see later lectures on classification • AUC for the classifier approach is ~0.9 • Total AUC for the whole approach on test data is 0.981 • Excellent accuracy for deanonymized nodes, less for rest CS910 Foundations of Data Analytics

Reflections on the paper • The score of 0.981 AUC was enough to win the competition • Second best was 0.969www.kaggle.com/c/socialNetwork/leaderboard • The researchers contacted the organizers to reveal their method • Were told that this was within the rules • Read the messageboard for the competition to see other opinionswww.kaggle.com/c/socialNetwork/forums • Lesson: when understanding data, think beyond what you have • Are there other data sets that can help understand it better? • Can you learn properties of one data set and transfer to another? • Can you link two data sets to learn more about the first? • Lesson: removing information does not “anonymize” data CS910 Foundations of Data Analytics

CS910: Foundations of Data Analytics