1 / 49

Network Analysis Class

This network analysis class provides an introduction to networks, graphs, nodes, and edges. Learn about degree, centrality, weight, and betweenness centrality in network analysis.

rfischer
Télécharger la présentation

Network Analysis Class

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Network Analysis Class Office of Portfolio Analysis Division of Program Coordination, Planning, and Strategic Initiatives National Institutes of Health Today’s Instructor: Patricia Forcinito patricia.forcinito@nih.gov

  2. Network Analysis Class Class Structure Network Analysis Class

  3. Introduction to Networks Network Analysis Class

  4. Graphs, Nodes & Edges • The structure of a network is referred to as a graph. • Graphs consist of a set of objects, called nodes (circles), where certain pairs of these objects may be connected by links called edges (lines). Edge list Node list Node A Edge B E C D F F Network Analysis Class

  5. Nodes Degree or Degree Centrality Is the number of edges a node has A A A B B B E • For example: • The degree of A is 1 • The degree of B is 3 • The degree of D is 0 • Isolated nodes: nodes that have degree 0 (e.g. node D) E C D C D Network Analysis Class

  6. Edges Weight Is the number of links between two nodes For example: For coauthors, the weight of the edges represents the number of times the authors published together; the more times they publish together, the stronger their relationship B A C D Edge list Office of Portfolio Analysis

  7. Edges Weight Is the number of links between two nodes For example: For coauthors, the weight of the edges represents the number of times the authors published together; the more times they publish together, the stronger their relationship B A C D Edge list Pub#1 Office of Portfolio Analysis

  8. Edges Weight Is the number of links between two nodes For example: For coauthors, the weight of the edges represents the number of times the authors published together; the more times they publish together, the stronger their relationship B A C D Edge list Pub#2, #3 Office of Portfolio Analysis

  9. Edges Weight Is the number of links between two nodes For example: For coauthors, the weight of the edges represents the number of times the authors published together; the more times they publish together, the stronger their relationship B A C D Edge list Pub#4, #5, #6 Office of Portfolio Analysis

  10. Edges Weight Is the number of links between two nodes For example: For coauthors, the weight of the edges represents the number of times the authors published together; the more times they publish together, the stronger their relationship B A C D Edge list Edges width are adjusted to their weight Office of Portfolio Analysis

  11. Component • Group of connected nodes. • Each component is a subset of nodes such that: • every node has a path to every other • the subset is not part of some larger set Component 2 Component 3 Component 1 Component 4 Component 5 Network Analysis Class

  12. Betweeness Centrality (BC) • BC Refers to the number of shortest paths from all nodes to all others that pass through a node. Office of Portfolio Analysis

  13. Degree & Betweeness Centrality D A C B Node with network’s highest degree D A C B Co-author network sized and colored based on degree centrality Network Analysis Class

  14. Degree & Betweeness Centrality D A C B Node with network’s highest degree Node with network’s highest betweeness D A D A C B C B Co-author network sized and colored based on betweeness centrality Co-author network sized and colored based on degree centrality Network Analysis Class

  15. Degree & Betweeness Centrality D A C B 3 isolates 12 nodes D D A C C B 12 nodes 13 nodes 8 nodes Network after removing node B Network after removing node A

  16. Network Analysis Class Working example Example to follow during the class What is the extent of collaboration of the NIH Tuberculosis portfoliobetween FY 2013 and 2014? Assumption: two authors who publish together have established some degree of collaboration. The portfolios to be studied using co-author networks are: • Tuberculosis RCDC category • Malaria RCDC category (Control Group) Network Analysis Class

  17. Data collection Network Analysis Class

  18. Data collectionExtract grants data from iSearch Grants Query: all NIH RPG awarded applications, FY:2013-2014 Tuberculosis RCDC category Link to iSearchquery: https://itools.od.nih.gov/isearch/grants/#search:searchId=59ee0daec3ddac60b7a7adc2 Malaria RCDC category Link to iSearchquery: https://itools.od.nih.gov/isearch/grants/#search:searchId=59ee0dd4c3ddac60b7a7adc4 Network Analysis Class

  19. Getting to know my data-GrantsAre my study groups comparable? Checkpoint 1 • It is very important to understand the data we are using to build the networks (in our example, we are starting with grants data): • # of applications • # core grants • Distribution of grants across ICs • Distribution of grants across mechanisms • Number of linked pubs • Other factors: Specific FOAs requesting collaboration and/or leadership groups established during the studied time frame, etc. Many times the extent of collaboration observed using co-author networks can be partly explained by some of the variables listed above Network Analysis Class

  20. Getting to know my data-GrantsAre my study groups comparable? Do I need more than one control group? Checkpoint 1 The Tuberculosis and Malaria portfolios are mostly concentrated in one IC Network Analysis Class

  21. Getting to know my data-GrantsAre my study groups comparable?Do I need more than one control group? Checkpoint 1 Portfolio distribution of award mechanisms between groups Network Analysis Class

  22. Data collectionExtract pubs data from iSearch-Pubs iSearchPubs Search query: Publications from 2014 to 2016 (grants time frame is from 2013 to 2014) Link to Tuberculosis iSearch query: https://itools.od.nih.gov/isearch/publications/#search:searchId=59ee0ec2c3ddac61900d2e7e Link to Malaria iSearch query: https://itools.od.nih.gov/isearch/publications/#search:searchId=59ee0e65c3ddac61900d2e7d Network Analysis Class

  23. Publications are linked to NIH grants through their core project number There are 54 pubs linked to these applications, from 1999 to 2018 WHY from 1999?

  24. Publications are linked to NIH grants through their core project number Our initial query

  25. Getting to know my data-GrantsAdvanced tip: Linking grants to publications Checkpoint 1 • Linkages between NIH grants and publications are made at the core grant level • Some implications of this methodology are: • When linking grants from a specific time frame, publications from other years (if present) will also be in the results. Tip: limit your publications only to the years you are interested in. • False positives: because publications are linked to the entire grant, all PIs involved in a grant will be linked to all the publications that mention that particular grant. • Important to note specially in awards with subprojects • PIs may be linked to publications where they are not authors Network Analysis Class

  26. Getting to know my data-PubsAre my study groups comparable? Checkpoint 2 Network Analysis Class

  27. Data cleaning-name disambiguationAll three steps involved in name disambiguation are needed to create a valid and meaningful co-author network Network Analysis Class

  28. Data cleaning-name disambiguation: backgroundHow much does the name disambiguation step affect co-author networks? Before name disambiguation After name disambiguation Hilderbrand, S Hilderbrand, Scott Hilderbrand, Scott A Weigl, B H Weigl, Bernhard Weigl, Bernhard H Gaydos, C Gaydos, C A Gaydos, Charlotte Gaydos, Charlotte A Hilderbrand, Scott A. Weigl, Bernhard Gaydos, Charlotte Network Analysis Class Co-author network Co-author network

  29. Data cleaning-name disambiguation has three steps Step 1: Automated assistance using the iSearch-Publications co-occurrence feature Step 2: Manual check of merged names Step 3: Manual check of unmerged names These three steps are key for creating a valid and meaningful co-author network Printed copies with a step by step description of how to do name disambiguation will be distributed during the class. The following slides give a brief description of this three-step process Network Analysis Class

  30. Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 1. https://itools.od.nih.gov/dashboard/ 2. • Once you have your publications in iSearch, go to the export button and select “Co-occurrence Graph” Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

  31. Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 • There are three output files under iSearch notifications: • Node file • Edge file • Docs file Note that the extension of the files is .TSV. To open them in Excel, download the files and then look for them in the “Downloads” folder of your computer iSearch notifications icon 3. Output files Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

  32. Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 To open the .TSV files in Excel, open a new Excel spreadsheet and click and drag the .TSV file on it • Output files: • Nodes file • This file has information about each node. The file has the following columns: • Best Name: Name for the node • Merged Names: Other names for this node • PMIDs: The Pubmedids this node occurred on • Grants: The grants linked to the publications this node occurred on • Clinical Trials; patents; journals; earliest pub year; latest pub year this node occurred on Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

  33. Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 • Output files: • Edges file • This file shows the relationship between the entities. The file has the following columns: • Source and target nodes column: In the example of co-author networks, the source and target nodes are interchangeable since co-author networks are undirected. • Weight: The number of times they co-occur is the weight column • PMIDs: the PMIDs where the source and target nodes co-occur in are listed in this column These two authors only published together one time, in PMID 21642420 therefore their edge’s weight is 1 These two authors only published together three times, in PMIDs 20096814, 20974641, and 21642420 therefore their edge’s weight is 3 Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

  34. Data cleaning-name disambiguation: Step 1 Automated assistance using the iSearch-Publications co-occurrence feature Step 1 • Output files: • Docs file • This file has information about each node. The file has the following columns: • PMID • Original names (as they appear in the publications) • Clustered names, which are the already disambiguated names Author “Lui, Julian” had two different denominations in the input file (original names column) and only one in the output file (clustered names column) Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

  35. Data cleaning-name disambiguation: Step 2 Manual check of merged names Step 2 • Output files: • The nodes file has a column called “merged names” • The Merged Names column shows you which names have been merged.If any name was merged but shouldn’t have, it has to be manually corrected in the Docs file’ “clustered names” column and the nodes and edges files provided by iSearch can no longer be used (they will not include any manual corrections). Instructions on how to create the nodes and edges files can be found at https://dpcpsi.nih.gov/sites/default/files/Data%20formatting-Step%20by%20step.pdf Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

  36. Data cleaning-name disambiguation Manual check of unmerged names Step 3 Part 1: List of Sci2 unique names from the Clustered Names Column Part 2: Excel pivot table showing all Last name First initial variations Docs file, Clustered Authors Column Typo? Same author? Please remember that all 3 disambiguation steps are key for creating a valid and meaningful co-author network

  37. Data formatting

  38. Data formatting Using Sci2 to get the data ready for Cytoscape Printed copies with a step by step description of how to get these files will be distributed during the class. The following two slides give a brief description of the data formatting process

  39. Today’s example Network Analysis Class

  40. Co-author network: Tuberculosis Stats from Cytoscape Disclaimer: this example is not part of any OPA analysis, and the data used has not been cleaned using Step 2 and Step 3 of name disambiguation. Therefore, it is only intended to be used as an example during this class and not for any other purpose. *LCC: Largest connected component

  41. Co-author network: Tuberculosis Analysis of the collaboration strength using the edges weight LCC 93% of nodes LCC 21% of nodes Isolates 70% of nodes Tuberculosis network Largest connected component Tuberculosis network showing only authors that published together 2 or more times Tuberculosis network

  42. Co-author network: Tuberculosis Analysis of the collaboration among nodes with degree of 50 or more Tuberculosis network showing only authors that published with at least 50 other authors Tuberculosis network Largest connected component Tuberculosis network

  43. Co-author network: Malaria Stats from Cytoscape Disclaimer: this example is not part of any OPA analysis, and the data used has not been cleaned using Step 2 and Step 3 of name disambiguation. Therefore, it is only intended to be used as an example during this class and not for any other purpose. *LCC: Largest connected component

  44. Co-author network: Malaria Analysis of the collaboration strength using the edges weight LCC 96% of nodes LCC 23.3% of nodes LCC 21% of nodes Isolates 70% of nodes Malaria network Largest connected component Malaria network showing only authors that published together 2 or more times Malaria network

  45. Co-author network: Malaria Analysis of the collaboration among nodes with degree of 50 or more LCC 96% of nodes Malaria network showing only authors that published with at least 50 other authors Malaria network Largest connected component Malaria network

  46. Co-author network: Tuberculosis vs Malaria Analysis of the collaboration strength using the edges weight LCC 21% of nodes LCC 23.3% of nodes LCC 21% of nodes Isolates 70% of nodes Isolates 70% of nodes Isolates 71% of nodes Disclaimer: these examples are not part of any OPA analysis, and the data used has not been cleaned using Step 2 and Step 3 of name disambiguation. Therefore, they are only intended to be used as an example during this class and not for any other purpose. Tuberculosis network showing only authors that published together 2 or more times Malaria network showing only authors that published together 2 or more times

  47. Co-author network: Tuberculosis vs Malaria Analysis of the collaboration among nodes with degree of 50 or more Disclaimer: these examples are not part of any OPA analysis, and the data used has not been cleaned using Step 2 and Step 3 of name disambiguation. Therefore, they are only intended to be used as an example during this class and not for any other purpose. Tuberculosis network showing only authors that published with at least 50 other authors Malaria network showing only authors that published with at least 50 other authors

  48. Cytoscape demo Network Analysis Class

  49. Hands on Network Analysis Class

More Related