1 / 62

Introduction to Network Analysis

Introduction to Network Analysis. Marko Grobelnik, Dunja Mladenic JSI. Parts of the presentation taken from the tutorial “Structure and function of real-world graphs and networks” by Jure Leskovec, CMU/JSI. Outline. What are networks? …few examples Network properties Small worlds

dawson
Télécharger la présentation

Introduction to Network Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Network Analysis Marko Grobelnik, Dunja Mladenic JSI Parts of the presentation taken from the tutorial “Structure and function of real-world graphs and networks” by Jure Leskovec, CMU/JSI

  2. Outline • What are networks? • …few examples • Network properties • Small worlds • Power law • Long tail • Network Resilience • Structure of networks • Applications • Mining e-mail server logs • Mining MSN Messenger data

  3. Networks & Computer Science Computer systems Machine learning / Data mining (complex) networks Statistics Theory and algorithms

  4. Networks & Science Computer systems Physics Machine learning / Data mining (complex) networks Statistics Computer Science (complex) networks Biology Theory and algorithms Social Sciences Industry & Applications

  5. Networks (graphs) Vertex / Node

  6. Networks (graphs) Vertex / Node Edge/ Link

  7. Networks (graphs) Vertex / Node Edge/ Link Direction

  8. Networks (graphs) Vertex / Node Edge/ Link 0.3 Direction 0.1 0.6 Probabilities

  9. Dynamic Networks (graphs) …in dynamic networks all the elements of the graph are changing Vertex / Node Edge/ Link 0.3 Direction 0.1 0.6 Probabilities …dealing with dynamic networks is active research topic

  10. Example of Dynamic Graph (1/3) Query Active topic during limited time period

  11. Example of Dynamic Graph (2/3) On 1996-08-30 Clinton and Chicago are connected

  12. Example of Dynamic Graph (3/3) On 1996-10-02 Clinton and Chicago are NOT connected

  13. Networks of the real-world (1) • Information networks: • World Wide Web: hyperlinks • Citation networks • Blog networks • Social networks: people + interactios • Organizational networks • Communication networks • Collaboration networks • Sexual networks • Collaboration networks • Technological networks: • Power grid • Airline, road, river networks • Telephone networks • Internet • Autonomous systems Florence families Karate club network Collaboration network Friendship network

  14. Networks of the real-world (2) • Biological networks • metabolic networks • food web • neural networks • gene regulatory networks • Language networks • Semantic networks • Software networks • … Semantic network Yeast protein interactions Language network XFree86 network

  15. Types of networks • Directed/undirected • Multi graphs (multiple edges between nodes) • Hyper graphs (edges connecting multiple nodes) • Bipartite graphs (e.g., papers to authors) • Weighted networks • Different type nodes and edges • Evolving networks: • Nodes and edges only added • Nodes, edges added and removed

  16. Traditional approach • Sociologists were first to study networks: • Study of patterns of connections between people to understand functioning of the society • People are nodes, interactions are edges • Questionares are used to collect link data (hard to obtain, inaccurate, subjective) • Typical questions: Centrality and connectivity • Limited to small graphs (~10 nodes) and properties of individual nodes and edges

  17. New approach (1) • Large networks (e.g., web, internet, on-line social networks) with millions of nodes • Many traditional questions not useful anymore: • Traditional: What happens if a node U is removed? • Now: What percentage of nodes needs to be removed to affect network connectivity? • Focus moves from a single node to study of statistical properties of the network as a whole • Can not draw (plot) the network and examine it

  18. New approach (2) • How the network “looks like” even if I can’t look at it? • Need for statistical methods and tools to quantify large networks • 3 parts/goals: • Statistical properties of large networks • Models that help understand these properties • Predict behavior of networked systems based on measured structural properties and local rules governing individual nodes

  19. Statistical properties of networks • Features common to networks of different types: • Properties of static networks: • Small-world effect • Transitivity or clustering • Degree distributions (scale free networks) • Network resilience • Community structure • Subgraphs or motifs • Temporal properties: • Densification • Shrinking diameter

  20. Small-world effect • Six degrees of separation (Milgram 60s) • Random people in Nebraska were asked to send letters to stockbrokes in Boston • Letters can only be passed to first-name acquantices • Only 25% letters reached the goal • But they reached it in about 6 steps • Measuring path lengths: • Diameter (longest shortest path): max dij • Effective diameter: distance at which 90% of all connected pairs of nodes can be reached • Mean geodesic (shortest) distance l

  21. Small World Networks on Web • Empirical observation for the Web-Graph is that the diameter of the Web-Graph is small relative to the size of the network • …this property is called “Small World” • …formally, small-world networks have diameter exponentially smaller then the size • By simulation it was shown that for the Web-size of 1B pages the diameter is approx. 19 steps • …empirical studies confirmed the findings

  22. Small World on FP5-IST (collaboration network) • The network represents collaboration between institutions on FP5-IST projects funded by European Union • …there are 7886 organizations collaborating on 2786 projects • …in the network, each node is an organization, two organizations are connected if they collaborate on at least one project • Small world properties of the collaboration network: • Main connected part of the network contains 94% of the nodes • Max distance between any two organizations is 7 steps … meaning that any organization can be reached in up to 7 steps from any other organization • Average distance between any two organizations is 3.15 steps (with standard deviation 0.38) • 38% (2770) of organizations have avg. distance 3 or less

  23. Connectedness of the most connected institution • 1856 collaborations • avg. distance is 1.95 • max. distance is 4

  24. Connectedness of semi connected institution • 179 collaborations • avg. distance is 2.42 • max. distance is 4

  25. Connectedness of min. connected institution • 8 collaborations • max. distance is 7

  26. Pick a random node, count how many nodes are at distance 1,2,3... hops Number of nodes Distance (Hops) Small World effect on MSN Messenger Network • Distribution of shortest path lengths • Microsoft Messenger network • 180 million people • 1.3 billion edges • Edge if two people exchanged at least one message in one month period 7

  27. What is Power Law? • Power law describes relations between the objects in the network • …it is very characteristic for the networks generated within some kind of social process • …it describes scale invariance found in many natural phenomena (including physics, biology, sociology, economy and linguistics)

  28. Power-Law on the Web • In the context of Web the power-law appears in many cases: • Web pages sizes • Web page connectivity • Web connected components’ size • Web page access statistics • Web Browsing behavior • Formally, power law describing web page degrees are: (This property has been preserved as the Web has grown)

  29. Degree distribution number of people a person talks to on a Microsoft Messenger Count Highest degree X Node degree

  30. Detour: how long is the long tail? This is not directly related to graphs, but it nicely explains the “long tail” effect. It shows that there is big market for niche products.

  31. Network resilience • We observe how the connectivity (length of the paths) of the network changes as the vertices get removed • It is important for epidemiology • Removal of vertices corresponds to vaccination • Real-world networks are resilient to random attacks • One has to remove all web-pages of degree > 5 to disconnect the web • …but this is a very small percentage of web pages • Random network has better resilience to targeted attacks

  32. Network motifs (1) • What are the building blocks (motifs) of networks? • Do motifs have specific roles in networks? • Network motifs detection process: • Count how many times each subgraph appears • Compute statistical significance for each subgraph – probability of appearing in random as much as in real network 3 node motifs

  33. Network motifs (2) • Biological networks • Feed-forward loop • Bi-fan motif • Web graph: • Feedback with two mutual diads • Mutual diad • Fully connected triad

  34. Shrinking diameters Internet • Intuition says that distances between the nodes slowly grow as the network grows (like log n) • But as the network grows the distances between nodes slowly decrease Citations

  35. Structure of the Web – “Bow Tie” model • In November 1999 large scale study using AltaVista crawls in the size of over 200M nodes and 1.5B links reported “bow tie” structure of web links • …we suspect, because of the scale free nature of the Web, this structure is still preserved

  36. TENDRILS – disconnected components reachable only via directed path from IN and OUT but not from and to core TENDRILS – disconnected components reachable only via directed path from IN and OUT but not from and to core SCC - Strongly Connected component where pages can reach each other via directed paths OUT – consisting from pages that can be reached from the core via directed path, but cannot reach core in a similar way IN – consisting from pages that can reach core via directed path, but cannot be reached from the core

  37. Mining email server logs

  38. Ontology generation from social networks data • We address the problem how to construct a taxonomy from a social network data. • …we adapt the approach used when dealing with text • As an example we use e-mail graph in a mid size research institution • ...communication records of JSI 770 people • The experiments and evaluation show our approach to be useful and applicable in real life situations • …the approach could be easily reused in case studies (and elsewhere)

  39. Architecture • The main contribution of the deliverable is architecture & software consisting from 5 major steps: • Starting with log files from the institutional e-mail server where the data include information about e-mail transactions with three fields: time, sender and the list of receivers. • After cleaning we get the data in the form of e-mail transactions which include e-mail addresses of sender and receiver. • From a set of e-mail transactions we construct a graph where vertices are e-mail addresses connected if there is a transaction between them • E-mail graph is transformed into a sparse matrix allowing to perform data manipulation and analysis operations • Sparse matrix representation of the graph is analyzed with ontology learning tools producing an ontological structure corresponding to the organizational structure of the institution where e-mails came from.

  40. Data used for Experimentation • The data is the collection of log files with e-mail transactions from local e-mail spam filter software Amavis (http://www.amavis.org/): • Each line of the log files denotes one event at the spam filter software • We were interested in the events on successful e-mail transactions • ...having information on time, sender, and list of receivers • An example of successful e-mail transaction is the following line: • 2005 Mar 28 13:59:05 patsy amavis[33972]: (33972-01-3) Passed CLEAN, [217.32.164.151] [193.113.30.29] <john.nj.davies@bt.com> -> <marko.grobelnik@ijs.si>, Message-ID: <21DA6754A9238B48B92F39637EF307FD0D4781C8@i2km41-ukdy.domain1.systemhost.net>, Hits: -1.668, 6389 ms

  41. Some statistics about the data • The log files include e-mails data from Sep 5th 2003 to Mar 28th 2005: • …this sums up to 12.8Gb of data. • After filtering out successful e-mail transactions it remains 564Mb • …which contains approx. 2.7 million of successful e-mail transitions used for further processing • The whole dataset contains references to approx. 45000 e-mail addresses • …after the data cleaning phase the number is reduced to approx. 17000 e-mail addresses • …out of which 770 e-mail addresses are internal from the home institution (with “ijs.si” domain name)

  42. Organizational structure of JSI produced from cleaned e-mail transactions with OntoGen in <5 minutes

  43. Organizational structure of JSI visualized from e-mail transactions with Document-Atlas

  44. Evaluation Part of clustering results for “Jozef Stefan Institute” e-mail data into 10 clusters (C-0, C-1, …C-9) showing distribution of the clustered e-mails over the Institute departments.

  45. Analysis of MSN Messenger Communication Network By Jure Leskovec

  46. Data that we have: Communication • For every conversation (session) we have a list of users who participated in the conversation • There can be multiple people per conversation • For each conversation and each user: • User Id • Time Joined • Time Left • Number of Messages Sent • Number of Messages Received

  47. Data that we have: Demographics • For every user (self reported): • Age • Gender • Location (Country, ZIP) • Language • IP address (we can do reverse GeoIP lookup)

  48. Facts about the data • 150 GB compressed logs per day • Just copying over the network takes 8 to 10 hours • Parsing and processing takes another 4 to 6 hours • After parsing, collapsing, saving as binary and compressing ~ 40GB per day • Collected data for all of June 2006: • 1.3TB of data

  49. User age distribution (self reported) Count Age

More Related