200 likes | 491 Vues
2. Motivation. The social process is an important, hard to study, aspect of any software engineering effortCan be studied in many stable and mature OSS projectsNearly all communication is done via internetRecords of both communication and development activity are freely available. 3. Apache Communication and Development (since 1996).
 
                
                E N D
1. Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz
Department of Computer Science
Anand Swaminathan
Graduate School of Management
University of California, Davis Get names right
Me  project  joint work withGet names right
Me  project  joint work with 
2. 2 Motivation The social process is an important, hard to study, aspect of any software engineering effort
Can be studied in many stable and mature OSS projects
Nearly all communication is done via internet
Records of both communication and development activity are freely available Mention that incorporation of newcomers  is important and needs to be understood
Mention that social process in traditional projects is hard to study. 
Records of both development activity and communication are archived and freely available for most OSS projects
(after first bullet)  incorporation of newcomers is vital to the success of an OSS project, so understanding it is valuable
Hard to study in traditional projects.
Not ALL communication is available.  A Large amount is.
We want to study the social process on the apache mailing listMention that incorporation of newcomers  is important and needs to be understood
Mention that social process in traditional projects is hard to study. 
Records of both development activity and communication are archived and freely available for most OSS projects
(after first bullet)  incorporation of newcomers is vital to the success of an OSS project, so understanding it is valuable
Hard to study in traditional projects.
Not ALL communication is available.  A Large amount is.
We want to study the social process on the apache mailing list 
3. 3 Apache Communication and Development (since 1996) 100,000+ messages on dev mailing list
70,000 CVS commits to files Next transition: How do we make sense of this data? We hope to quantitatively evaluate some some common beliefs about the social structure of OSS project..
Enlarge labels and add years
Correlate with major releases
MEMORIZE: Our goal is to use this data to quantitatively evaluate existing hypotheses regarding the social structure of OSS projects
Next transition: How do we make sense of this data? We hope to quantitatively evaluate some some common beliefs about the social structure of OSS project..
Enlarge labels and add years
Correlate with major releases
MEMORIZE: Our goal is to use this data to quantitatively evaluate existing hypotheses regarding the social structure of OSS projects
 
4. 4 It is widely believed that OSS communities form a hierarchy Either on the slide, on in your talk, mention that this view is qualitative, and would benefit from a quantitative analysis. Look for documenter mailing list.
Use SNA to put this diagram into a more formal, quantitative basis.Either on the slide, on in your talk, mention that this view is qualitative, and would benefit from a quantitative analysis. Look for documenter mailing list.
Use SNA to put this diagram into a more formal, quantitative basis. 
5. 5 Social Networks A network consisting of actors and their social ties to each other. Just say nodes are people and ties are dating relationship
Some people are more connected and central than others
Transition: this same formalism of sn has been used in analyzing OSS project beforeJust say nodes are people and ties are dating relationship
Some people are more connected and central than others
Transition: this same formalism of sn has been used in analyzing OSS project before 
6. 6 Related Work Xu, Gao, Christley, and Madey looked at developers who worked on the same projects
Crowston & Howison co-ocurrence of developers on a bug-report as a social link
Lopez, Gonzalez-Barahona, & Robles created networks of developers and modules via CVS data. 
We believe that responses to emails indicates a strong social link. Unfortunately, there are some hoops to jumped through
Robe-layz , get the names right.
Mention that we get a much larger network because we dont include just devs.Unfortunately, there are some hoops to jumped through
Robe-layz , get the names right.
Mention that we get a much larger network because we dont include just devs. 
7. 7 Issues with Mailing List Analysis 
Extracting conversation threads
Rationalizing Timestamps 
Identifying targets in a broadcast medium
Resolving Email Aliases
Extracting Content Need to recreate message threads by looking at replies
Need to deal with different time zones and remove messages where clock wasnt set properly
Talk about extracting/analyzing textual content of message (its hard) before aliasing (dont mention patches)
Need to recreate message threads by looking at replies
Need to deal with different time zones and remove messages where clock wasnt set properly
Talk about extracting/analyzing textual content of message (its hard) before aliasing (dont mention patches)
 
8. 8 Email Aliases 2,544 different email address aliases have been used on the apache dev mailing list since 1996.
Many of these email addresses belong to the same people.
The following email addresses were all used by Joe Orton.
 Many active developers use the most aliases
Dont spend too much time on the example.
We just want to exploit the similarity of the emailsMany active developers use the most aliases
Dont spend too much time on the example.
We just want to exploit the similarity of the emails 
9. 9 Email Alias Analysis Preprocess name and address.
Remove commas (orton, joe -> joe orton)
Normalize whitespace and remove punctuation and common prefixes/suffixes (Mr., jr., etc.) 
Remove common email terms (list, admin, root)
2. Use heuristics and fuzzy matching (Levenshtein edit distance) to determine what email aliases are similar.  
name-name: joe orton vs. joe e. orton
email-email: jorton@foo.com vs jorton@bar.org
name-email:joe orton vs. jorton@foo.com
3. Manually post process aliases marked as similar to remove the high level of false positives
4. Use similar process to map CVS accounts to email aliases This is not an algorithm
Preprocess by splitting around commas, removing whitespace and punctuation
No need to explain edit distance
We use edit distance and heuristics such as chris bird is cbird and chrisb
We use this to build clusters and manually postprocess the clusters
How many singletons were there?
Transition to talk about resultsThis is not an algorithm
Preprocess by splitting around commas, removing whitespace and punctuation
No need to explain edit distance
We use edit distance and heuristics such as chris bird is cbird and chrisb
We use this to build clusters and manually postprocess the clusters
How many singletons were there?
Transition to talk about results 
10. 10 Alias Results 2,544 email aliases used
2,008 unique identities used
Many of the high volume participants had a large number of aliases 
11. 11 Creating the Email Social Network Each email message has a message id.
A response message contains an in-response-to header which includes the message id of the previous message.
If Joe posts a message and Bob responds, then there is indication of information flow and we create a directed tie from Joe to Bob.
We have built a tool that will create a directed, valued, adjacency matrix of participants from our mailing list database for any time period. Show bob-alice animation example here and talk through it.  Should go faster
Message-ids match so theres a social network link.Show bob-alice animation example here and talk through it.  Should go faster
Message-ids match so theres a social network link. 
12. 12 Intro to Social Network Metrics	 In-degree  The number of links whose head is connected to a particular actor
Out-degree  The number of links whose tail is connected to a particular actor
Geodesic  A shortest path between two actors
Betweenness  The number of geodesics that a particular actor lies on. In this slide, explain what betweenness means and why its important in a sna context.In this slide, explain what betweenness means and why its important in a sna context. 
13. 13 
14. 14 Betweenness more formally Tansition: to put this idea of betweenness in context on the apache mailing list, its useful to look at a picture of it: Tansition: to put this idea of betweenness in context on the apache mailing list, its useful to look at a picture of it:  
15. 15 Dont say all info flows through ryan bloom.  Past isnt complete predictor of future.
Now, the complete social network  is too big to show, but its useful to look at some distribution
Data of the graphs. Dont say all info flows through ryan bloom.  Past isnt complete predictor of future.
Now, the complete social network  is too big to show, but its useful to look at some distribution
Data of the graphs.  
16. 16 The distribution of in-degree and out-degree both exhibit a power-law character What we have extracted is typical of a sn
Now, we turn to the question: is there any different between developers and non-developers in this social network? 
Enlarge  labels and state clearly that Its log-logWhat we have extracted is typical of a sn
Now, we turn to the question: is there any different between developers and non-developers in this social network? 
Enlarge  labels and state clearly that Its log-log 
17. 17 Status of Developers vs. Non-Developers Note that the largest discrepancy between devs and non-devs is found in the betweenness metric.  This indicates that developers are gate-keepers or information brokers in the email network.  In-degree and out-degree are local measures, whereas betweenness is a more global metric.
Transition: now its not that developers are different from non-developers; theres actually a strong relationship of social network status and development activity. Note that the largest discrepancy between devs and non-devs is found in the betweenness metric.  This indicates that developers are gate-keepers or information brokers in the email network.  In-degree and out-degree are local measures, whereas betweenness is a more global metric.
Transition: now its not that developers are different from non-developers; theres actually a strong relationship of social network status and development activity.  
18. 18 Correlation between communication and development Drop the last three columns.  Circle the correlations of interest.  Divide sn metrics and dev metrics.
Circle the relvant ocrrelations  
Next we can see that developers and non-developers can be distinguished from their degrees, right from the time
They first appaer on the email list. 
Add arrows for the first and second bulletsDrop the last three columns.  Circle the correlations of interest.  Divide sn metrics and dev metrics.
Circle the relvant ocrrelations  
Next we can see that developers and non-developers can be distinguished from their degrees, right from the time
They first appaer on the email list. 
Add arrows for the first and second bullets 
19. 19 Observations from the network The mailing list activity reflects a typical social network. 
Developers are the key social brokers. 
More active developers tend to be more important.
Results robust: Postgres showed similar results. 
 Active development -> important in social networkActive development -> important in social network 
20. 20 Topics of future research Visualization of software and social data
Who becomes a developer?
Relationship between communication and collaboration networks
Network Evolution
Conways Law
 Who becomes a developer?  What variables affect who becomes a developer and who doesnt? number of patches submitted, emails to core devs, betweenness, length of time on mailinglist, etc.)
Who becomes developer  we have new recent data that wed be happy to share ask me or my advisor
Relationship between communication and collaboration networks.  Are defects more likely to occur if two people collaborate but dont communicate?
Network Evolution  How do the networks change over time.  What events cause or precede these changes?
Conways LawWho becomes a developer?  What variables affect who becomes a developer and who doesnt? number of patches submitted, emails to core devs, betweenness, length of time on mailinglist, etc.)
Who becomes developer  we have new recent data that wed be happy to share ask me or my advisor
Relationship between communication and collaboration networks.  Are defects more likely to occur if two people collaborate but dont communicate?
Network Evolution  How do the networks change over time.  What events cause or precede these changes?
Conways Law 
21. 21 Average In-Degree Throw all pictures into the same slide.Throw all pictures into the same slide.