1 / 63

Lada Adamic, HP Labs, Palo Alto, CA

Information dynamics in the networked world. Lada Adamic, HP Labs, Palo Alto, CA. Talk outline. Information flow through blogs. Information flow through email. Search through email networks. Search within the enterprise. Search in an online community. Blog use:

mike_john
Télécharger la présentation

Lada Adamic, HP Labs, Palo Alto, CA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information dynamics in the networked world Lada Adamic, HP Labs, Palo Alto, CA

  2. Talk outline Information flow through blogs Information flow through email Search through email networks Search within the enterprise Search in an online community

  3. Blog use: Record real-world and virtual experiences Note and discuss things “seen” on the net Blog structure: blog-to-blog linking Use + Structure Great to track “memes” (catchy ideas) Implicit Structure and Dynamics of BlogSpaceEytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose

  4. Patterns of information flow How does the popularity of a topic evolve over time? Who is getting information from whom? Ranking algorithms that take advantage of transmission patterns Approaches and uses of blog analysis

  5. Slashdot Effect BoingBoing Effect Tracking popularity over time Popularity Time Blogdex, BlogPulse, etc. track the most popular links/phrases of the day

  6. Different kinds of information have differentpopularity profiles 1 Major-news site (editorial content) – back of the paper Products, etc. Slashdotpostings Front-pagenews 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 5 10 15 5 10 15 5 10 15 5 10 15 % of hits received on each day since first appearance

  7. Micro example: Giant Microbes

  8. What do we need track specific info ‘epidemics’? Timings Underlying network b2 b3 Microscale Dynamics b1 t0 Time of infection t1

  9. Challenges Root may be unknown Multiple possible paths Uncrawled space, alternate media (email, voice) No links b2 b3 Microscale Dynamics bn b1 ? ? t0 Time of infection t1

  10. Explicit blog to blog links (easy) Via links are even better Implicit/Inferred transfer (harder) Use ML algorithm for link inference problem Support Vector Machine (SVM) Logistic Regression What we can use Full text Blogs in common Links in common History of infection Microscale Dynamics who is getting info from whom

  11. Zoomgraph tool Using GraphViz (by AT&T) layouts Simple algorithm If single, explicit link exists, draw it Otherwise use ML algorithm Pick the most likely explicit link Pick the most likely possible link Tool lets you zoom around space, control threshold, link types, etc. Visualization http://www-idl.hpl.hp.com/blogstuff

  12. Giant Microbes epidemic visualization via link inferred link blog explicit link

  13. Find early sources of good information using inferred information paths or timing iRank b1 True source b2 Popular site b3 b4 … b5 bn

  14. iRank Algorithm • Draw a weighted edge for all pairs of blogs that cite the same URL • higher weight for mentions closer together • run PageRank • control for ‘spam’ t0 Time of infection t1

  15. 02:00 AM Friday Mar. 05, 2004 PSTWired publishes: "Warning: Blogs Can Be Infectious.” 7:25 AM Friday Mar. 05, 2004 PSTSlashdot posts: "Bloggers' Plagiarism Scientifically Proven" 9:55 AM Friday Mar. 05, 2004 PSTMetafilter announces "A good amount of bloggers are outright thieves." Do Bloggers Kill Kittens?

  16. Information flow in social groups Fang Wu, Bernardo Huberman, Lada Adamic, Joshua Tyler

  17. Spread of disease is affected by the underlying network co-worker mom college friend co-worker mike co-worker

  18. Spread of computer viruses is affected by the underlying network co-worker mom college friend co-worker mike co-worker

  19. Difference between information flow and disease/virus spread Viruses (computer and otherwise) are shared indiscriminately (involuntarily) Information is passed selectively from one host to another based on knowledge of the recipient’s interests

  20. Spread of information is affected by its content, potential recipients, and network topology co-worker mom college friend co-worker mike co-worker

  21. homophily: individuals with like interests associate with one another personal homepages at Stanford distance between personal homepages

  22. m=2 m=0 m=1 The Model: Decay in transmission probability as a function of the distance m between potential target and originating node T(m) = (m+1)-b T power-law implies slowest decay

  23. Virus, information transmission on a scale free network P(k) outdegree k Degree distribution of all senders of email passing through the HP email server

  24. Wu et al. (2004) Newman (2002) Pastor-Satorras & Vespignani (2001) epidemics on scale free graphs 106 nodes, epidemic if 1% (104) infected 1 k ¥ b = , =0 0.8 k b =100, =0 k b =100, =1 0.6 critical threshold 0.4 0.2 0 1 1.5 2 2.5 3 3.5 4 a

  25. Study of the spread of URLs and attachments 40 participants (30 within HPL, 10 elsewhere in HP & other orgs) 6370 URLs and 3401 attachments crypotgraphically hashed Question: How many recipients in our sample did each item reach? caveats: messages are deleted (still, the median number of messages > 2000) non-uniform sample

  26. forwarded message forwarded URLs Only forwarded messages are counted

  27. 4 10 email attachments -4.1 x URLs -3.6 3 x 10 2 number of items with so many recipients 10 1 10 0 10 0 1 10 10 number of recipients short term expense control Results average = 1.1 for attachments, and 1.2 for URLs ads at the bottom of hotmail & yahoo messages

  28. Simulate transmission on email log each message has a probability p of transmitting information from an infected individual to the recipient 02/19/2003 15:45:33 I-1 I-2 02/19/2003 15:45:33 I-1 I-3 02/19/2003 15:45:40 E-1 I-4 02/19/2003 15:45:52 I-5 E-2 02/19/2003 15:45:55 E-3 I-6 02/19/2003 15:45:58 I-7 I-8 02/19/2003 15:46:00 E-4 I-9 02/19/2003 15:46:05 I-10 I-11 02/19/2003 15:46:10 I-12 I-13 02/19/2003 15:46:10 I-12 I-14 02/19/2003 15:46:10 I-12 I-15 02/19/2003 15:46:14 I-16 E-5 . . . . . . . . internal node external node

  29. Simulation of information transmission on the actual HP Labs email graph an individual is infected if they receive a particular piece of information individuals remain infected for 24 hours start by infecting one individual at random every time an infected individual sends an email they have a probability p of infecting the recipient track epidemic over the course of a week, most run their course in 1-2 days

  30. distance 1 distance 1 Introduce a decay in the transmission probability based on the hierarchical distance hAB = 5 distance 2 distance 2 B A

  31. 7119 potential recipients p0

  32. Conclusions on info flow in social groups Information spread typically does not reach epidemic proportions Information is passed on to individuals with matching properties The likelihood that properties match decreases with distance from the source Model gives a finite threshold Results are consistent with observed URL & attachment frequencies in a sample Simulations following real email patterns also consistent

  33. MA NE How to search in a small world Milgram’s experiment: Given a target individual and a particular property, pass the message to a person you correspond with who is “closest” to the target.

  34. Small world experiment at Columbia Dodds, Muhamad, Watts, Science 301, (2003) email experiement conducted in 2002 18 targets in 13 different countries 24,163 message chains 384 reached their targets average path length 4.0

  35. Why study small world phenomena? Curiosity: Why is the world small? How are people able to route messages? Social Networking as a Business: Friendster, Orkut, MySpace LinkedIn, Spoke, VisiblePath

  36. Six degrees of separation - to be expected Pool and Kochen (1978) - average person has 500-1500 acquaintances Ignoring clustering, other redundancy … ~ 103 first neighbors, 106 second neighbors, 109 third neighbors But networks are clustered: my friends’ friends tend to be my friends Watts & Strogatz (1998) - a few random links in an otherwise clustered graph give an average shortest path close to that of a random graph

  37. But how are people are able to find short paths? How to choose among hundreds of acquaintances? Strategy: Simple greedy algorithm - each participant chooses correspondent who is closest to target with respect to the given property Models geography Kleinberg (2000) hierarchical groups Watts, Dodds, Newman (2001), Kleinberg(2001) high degree nodes Adamic, Puniyani, Lukose, Huberman (2001), Newman(2003)

  38. Spatial search Kleinberg (2000) “The geographic movement of the [message] from Nebraska to Massachusetts is striking. There is a progressive closing in on the target area as each new person is added to the chain” S.Milgram ‘The small world problem’, Psychology Today 1,61,1967 nodes are placed on a lattice and connect to nearest neighbors additional links placed with f(d)~ d(u,v)-r if r = 2, can search in polylog (< (logN)2) time

  39. Kleinberg: searching hierarchical structures ‘Small-World Phenomena and the Dynamics of Information’, NIPS 14, 2001 Hierarchical network models: h is the distance between two individuals in hierarchy with branching b f(h) ~ b-ah If a = 1, can search in O(log n) steps Group structure models: q = size of smallest group that two individuals belong to f(q) ~ q-a If a = 1, can achieve in O(log n) steps

  40. Identity and search in social networks Watts, Dodds, Newman (2001) individuals belong to hierarchically nested groups multiple independent hierarchies coexist pij ~ exp(-a x)

  41. Identity and search in social networks Watts, Dodds, Newman (2001) There is an attrition rate r Network is ‘searchable’ if a fraction q of messages reach the target N=102400 N=204800 N=409600

  42. High degree search Adamic et al. Phys. Rev. E, 64 46135 (2001) Mary Who could introduce me to Richard Gere? Bob Jane

  43. 67 63 54 1 power-law graph number of nodes found 94 6 2

  44. 19 15 11 7 3 1 Poisson graph number of nodes found 93

  45. 3 10 2 10 1 10 0 10 1 2 3 4 5 10 10 10 10 10 Scaling of search time with size of graph Sharp cutoff at k~N1/a , 2nd degree neighbors random walk a = 0.37 fit degree sequence a =0.24 fit covertime for half the nodes size of graph

  46. Testing the models on social networks (w/Eytan Adar) Use a well defined network: HP Labs email correspondence over 3.5 months Edges are between individuals who sent at least 6 email messages each way Node properties specified: degree geographical location position in organizational hierarchy Can greedy strategies work?

  47. Strategy 1: High degree search Degree distribution of all senders of email passing through the HP email server outdegree

  48. Filtered network (6 messages sent each way) Degree distribution no longer power-law, but Poisson 450 users median degree = 10 mean degree = 13 average shortest path = 3 High degree search performance (poor): median # steps = 16 mean =40

  49. Strategy 2: Geography

  50. Communication across corporate geography 1U 1L 87 % of the 4000 links are between individuals on the same floor 4U 3U 2U 2L 3L

More Related