310 likes | 878 Vues
ACM IMC 2007-10-24. I Tube, You Tube, Everybody Tubes… Analyzing the World’s Largest U ser G enerated C ontent Video System. Meeyoung Cha (Intern at Telefonica Research / KAIST). Why the study of. “ bite-size bits for high-speed munching ” [Wired mag. Mar 2007]
E N D
ACM IMC 2007-10-24 I Tube, You Tube, Everybody Tubes…Analyzing the World’s Largest User Generated Content Video System Meeyoung Cha (Intern at Telefonica Research / KAIST)
Why the study of “bite-size bits for high-speed munching” [Wired mag. Mar 2007] • Plethora of YouTube clones • UGC is very different How different?
UGC vs. Non-UGC • Massive production scale 15 days in YouTube to produce 120-yr worth of movies in IMDb! • Extreme publishers 1000 uploads over few years vs. 100 movies over 50 years • Short video length 30 sec–5 min vs. 100 min movies in LoveFilm the rest: consumption patterns
Goals and Data • Popularity distribution • Popularity evolution • P2P scalable distribution • Content duplication • Crawled YouTube and other UGC systems metadata: video ID, length, views 1.6M Entertainment, 250KScience videos Goals Data
Part1: Popularity Distribution Static popularity characteristics Underlying mechanism
Pareto Principle • 10% popular videos account for 80% total views Other online VoD systems show smaller skew! Fraction of aggregate views Normalized video ranking
Dominant Power-Law Behavior • Richer-get-richer principle If video has K views, then users will watch the video with rate K • word frequency- citations of papers - scale of earthquakes • web hits a y=x Frequency (log) City population (log)
UGC Video Distribution • Straight-line waists and truncated both ends
Focusing on Popular Videos • Why popular videos deviate from power-law? • Fetch-at-most-once[SOSP2003] • Behavior of fetching immutable objects oncecf. visiting popular web sites many times
Simulation on Various Parameters • Number of videos (V), users (U), avg. requests per user (R) Fetch-at-most-once Tail is more truncated forlarger R and smaller V (log) U=1000 R=10 power-lawbehavior Comp. cumulative videos (log) R=50 R=20 R=10 V=100 Views (log)
Why the Unpopular Tail Falls Off • Natural shape is curved • Sampling bias or pre-filters • Publishers tend to upload interesting videos • Information filtering or post-filters • Search results or suggestions favor popular items
Impact of Post-Filters • Videos exposed longer to filtering effect appear more truncated video rank
Is it Naturally Curved? • Matlab curve fitting for Science Science videos Zipf Zipf + exp cutoff Exponential Log-normal
Is it Naturally Curved? • Matlab curve fitting for Science Science videos Zipf is scale-free, while exponential is scaled : underlying mechanism is Zipf and truncation is due to bottlenecks Zipf Zipf + exp cutoff Exponential Log-normal
Implication of Our Findings “ Latent demand for products that is suppressed by bottlenecks in the system [Chris Anderson, The Long Tail] ” Views Entertainment 40% additional views! How? Personalized recommendation Enriched metadataAbundant videos Rankings
Part2: Popularity Evolution Relationship between popularity and age
Popularity Evolution • So far, we focused on static popularity • Now focus on popularity dynamics • How requests on any given day are distributed across the video age? • 6-day daily trace of Science videos • Step1- Group videos requested at least once by age • Step2- Count request volume per age group
Request Volume Across Age 1. Viewers mildly more interested in new videos
Request Volume Across Age 2. User preference relatively insensitive to age ← 80% requests on old videos
Request Volume Across Age 3. Daily top hits mostly come from new videos
Request Volume Across Age 4. Some old videos get significant requests
Part3: P2P Scalable Distribution Potential savings from P2P (against client-server model) Optimistic upper bound
Peer-assisted VoD • 50-200 Gb/s estimated serving capacity • Bandwidth, hardware, power consumption • Stream from VoD servers or from peers • Varying user lifetime video server movie2 movie1 movie1 user C user A user B P2P when possible
Number of Beneficiary Videos • P2P viable whenat least 2 online usersshare video • Very few videos benefit, but they benefit a lot Estimated number of online users per video at any moment
Server Workload Savings in P2P • Potential for significant savingsDue to skewed and temporal request patterns P2P-assisted
Part4: Content Duplication Level of duplication Birth of duplicates
Content Duplication • Alias-identical or similar copies of the same content • Aliases dilute popularity of a single event • Views distributed across multiple copies • Difficulty in recommendation & ranking systems • Test with 51 volunteers • Find alias using keyword search • Identified 1,224 aliases for 184 original videos
The Level of Popularity Dilution • Popularity diluted up to 2-order magnitude
How Late Aliases Appear? • Significant aliases appear within one week
Contribution • The first detailed study on UGC video popularity • Power-law waist • Truncation at popular/non-popular videos • Analyzed popularity dynamicsusing daily trace • Relationship between popularity and age • Explored potential for P2Pdistribution • Showed difficulty in video ranking due to aliases
Dataset available at http://an.kaist.ac.kr/traces/IMC2007.html Meeyoung Cha meeyoung.cha@gmail.com Questions?