1 / 49

Topology and Evolution of the Open Source Software Community

Topology and Evolution of the Open Source Software Community Yongqin Gao Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science Foundation – Digital Science & Technology Outline Overview Data collection Network modeling

Leo
Télécharger la présentation

Topology and Evolution of the Open Source Software Community

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topology and Evolution of the Open Source Software Community Yongqin Gao Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science Foundation – Digital Science & Technology

  2. Outline • Overview • Data collection • Network modeling • Topological statistical analysis (real data) • Simulations • Publications • Conclusions

  3. Overview (about OSS) • What is OSS • Free to use, free to distribute • Unlimited user and usage • Source code available and modifiable • Potential advantages over commercial software • Higher quality • Faster development • Lower cost • Transparent

  4. Overview (about our research) • Our goal • Understanding the OSS phenomenon • Approach • SourceForge is the source of our empirical data • Modeling as a social network • Analysis of topological statistics • Use simulation to verify and validate the model

  5. Outline • Overview • Data collection • Network modeling • Topological statistical analysis • Simulations • Publications • Conclusions

  6. Data Collection — Monthly • Web crawler (scripts) • Python • Shell • AWK • Sed • Monthly • Since Jan 2001 • ProjectID • DeveloperID • Almost 2 million records • Relational database PROJ|DEVELOPER 8001|dev348 8001|dev8972 8001|dev9922 8002|dev27650 8005|dev31351 8006|dev12409 8007|dev19935 8007|dev4262 8007|dev36711 8008|dev8972

  7. Outline • Overview • Data collection • Network modeling • Topological statistical analysis (real data) • Simulations • Publications • Conclusions

  8. Modeling as Collaboration Network • What is a collaboration network? • A social network representing the collaborating relationships. • Movie actor network and scientist collaboration network • Difference of SourceForge collaboration network • Link detachment • Virtual collaboration • Voluntary • Global • Bipartite property of collaboration networks

  9. Collaboration network - bipartite Adapted from Newman, Strogatz and Watts, 2001

  10. dev[72] dev[67] dev[52] dev[65] dev[70] dev[57] 7597 dev[46] 6882 dev[47] dev[45] dev[64] dev[99] 7597 dev[46] 7597 dev[46] dev[52] dev[72] dev[67] 7597 dev[46] dev[47] 6882 dev[47] dev[55] dev[55] dev[55] 7597 dev[46] 7028 dev[46] dev[70] 7597 dev[46] 7028 dev[46] dev[57] dev[45] dev[51] dev[99] 7597 dev[46] 7028 dev[46] 6882 dev[47] 6882 dev[58] dev[61] dev[51] dev[79] dev[47] dev[58] 7597 dev[46] dev[58] dev[46] 9859 dev[46] dev[54] 15850 dev[46] dev[58] 9859 dev[46] dev[79] Dev[80] 9859 dev[46] dev[49] dev[53] 9859 dev[46] 15850 dev[46] dev[59] dev[56] 15850 dev[46] dev[83] 15850 dev[46] dev[48] dev[53] dev[56] dev[83] dev[48] SourceForge Developer Network OSS Developer Network (Part) Project 7597 Developers are nodes / Projects are links 24 Developers dev[64] 5 Projects 2 hub Developers Project 6882 1 Cluster Project 7028 dev[61] dev[54] dev[49] dev[59] Project 9859 Project 15850

  11. Outline • Overview • Data collection • Network modeling • Topological statistical analysis (real data) • Simulations • Publications • Conclusion

  12. Topological Analysis • Statistics inspected • Diameter • Average degree • Clustering coefficient • Degree distribution • Cluster size distribution • Relative size of major cluster • Fitness and life cycle • Evolution of these statistics • Dual networks • developer network and project network

  13. Terminology • Diameter • Average length of shortest paths between all pairs of vertices • Degree • The count of edges connected to given vertex • Average degree • Average of the degrees of all vertices in the network • Cluster • The connected components of the network • Clustering coefficient (CC) • CCi: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. • CC: average of all CCi in a network • Degree distribution • The distribution of degrees throughout a network • Major cluster • The largest cluster in the network

  14. Diameter of Developer Network vs. Time • Network size increased from 30,000 to 70,000

  15. Diameter of Project Network vs. Time • Network size increased from 20,000 to 50,000. • Diameter decreasing with time both for developer network and project network

  16. Clustering Coefficient of Developer Network vs. Time

  17. Clustering Coefficient of Project Network vs. Time

  18. Degree Distribution (developers)

  19. Degree Distribution (projects)

  20. Cluster Size Distribution • R2 with major cluster is 0.7426 • R2 without major cluster is 0.9799

  21. Relative Size of Major Cluster vs. Time • Increase of the relative size of the major cluster • Increasing rate is decreasing • May be an indication of the network evolution

  22. Existence of Fitness • Investigation of development of single project can verify the existence of “newcomer” phenomenon • We tracked the development of every new project in July 2001 until now (total 1660 projects) • Maximal monthly growth per project is 13 while average monthly growth per project is just 0.3639

  23. Life Cycle of Project

  24. Summary

  25. Summary of Results • Power law rules • Degree distributions, cluster distribution • Average degree increasing with time • Diameter decreasing with time • Clustering coefficient decreasing with time • Fitness existed in SourceForge • Projects have life cycle behaviors

  26. Outline • Overview • Data collection • Network modeling • Topological statistical analysis (real data) • Simulations • Publications • Conclusion

  27. Conceptual Framework

  28. Agent-based Modeling • EBM vs. ABM • Heterogeneous individuals • Complex network • Experience environment • Hardware: computer cluster • Software: • Simulation toolkits: Swarm • Database: Oracle • Language: Java, PL/SQL

  29. Model for SourceForge • ABM based on bipartite graph • Model description • Agent: developer • Behaviors: Create, join, abandon and idle • Preference: developer’s and project’s • Fitness • Four models in iterations • ER, BA, BA with constant fitness and BA with dynamic fitness • Comparison of empirical and simulated data

  30. ER Model - Diameter • Average degree is decreasing while it is increasing in empirical data • Diameter is increasing while it is decreasing in empirical data

  31. ER Model – Clustering Coefficient • Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data.

  32. ER Model – Degree Distribution • Degree distribution is normal distribution while it is power law in empirical data

  33. ER Model – Cluster Size Distribution • power law distribution with R2 as 0.6667 (0.9653 without the major cluster) while R2 in empirical data is 0.7426 (0.9799 without the major cluster) • The actual distribution is different from empirical data

  34. BA Model – Diameter and Clustering Coefficient • Small diameter and high clustering coefficient like empirical data • Diameter and clustering coefficient are both decreasing like empirical data

  35. BA Model – Degree Distribution • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9798 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.6650 and empirical data has R2 as 0.9838.

  36. BA Model with Constant Fitness • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9742 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.7253 and empirical data has R2 as 0.9838.

  37. BA Model with Dynamic Fitness • Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9695 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.8051 and empirical data has R2 as 0.9838.

  38. Advantage of Dynamic Fitness • Intuition: Fitness should decreasing with time. • Statistics: project has life cycle behavior which can not be replicated by BA model with constant fitness but can be replicated by BA model with dynamic fitness

  39. Summary

  40. Summary of Results • We use ABM to model and simulate the SourceForge collaboration network. • Conceptual framework is proposed for agent-based modeling and simulation. • Case study of this framework: SourceForge study through ER, BA, BA with constant fitness and BA with dynamic fitness.

  41. Outline • Overview • Data collection • Network modeling • Topological statistical analysis (real data) • Simulations • Publications • Conclusion

  42. Publications To-date • Yongqin Gao, "Modeling and Simulation of  the OSS Community", Seventh Annual Swarm Researchers Meeting (Swarm2003), Notre Dame, IN, 2003. • Yongqin Gao, Vince Freeh, and Greg Madey, "Analysis and Modeling of the Open Source Software Community", NAACSOS Conference 2003, Pittsburgh. • Yongqin Gao, Vince Freeh, and Greg Madey, "Conceptual Framework for Agent-based Modeling and Simulation", NAACSOS Conference 2003, Pittsburgh. • Greg Madey, Vincent Freeh, Renee Tynan, Yongqin Gao, Chris Hoffman, "Agent-based Modeling and Simulation of Collaborative Social Networks", AMCIS 2003, Tampa, FL.

  43. Possible Journals • Chapter 3 • Physica A: statistical mechanics and its applications • Journal of Social Structure (JSS) • Chapter 4 • Journal of Artificial Societies and Social Simulation (JASSS) • Journal of Statistical Computation and Simulation (JSCS)

  44. Outline • Overview • Data collection • Network modeling • Topological statistical analysis (real data) • Simulations • Publications • Conclusion

  45. Conclusion • Study of SourceForge collaboration network can help us understanding the OSS community • We investigate not only the topological statistics but also the evolution of these statistics. • Simulation is used to investigate of SourceForge collaboration network.

  46. Contribution • Statistical study of the SourceForge community (snapshot and evolution) • Verification of the approximate method to calculate the diameter and CC • Proposal of a model for the SourceForge community • Improvement of dynamic fitness to BA model

  47. Future Work • Data collection • Database dump from SourceForge (PostgreSQL 8GB) • All the possible attributes • Database schema in UML • More topology analysis (with more attributes) • Discussion forum • Task assignment • Project management • Active testing • Behavior-based analysis • Interaction between agents • H. Beyton Young’s model • Information entropy analysis

  48. Acknowledgements • Committee • Advisors • Colleagues • SourceForge • NSF • Others

  49. Thank you

More Related