Social Network Analysis of Rietveld Subversion Repository

Recently I have been studying if we can apply the social network analysis (SNA) to Subversion repositories. I also wrote a blog post on it. I am using the sqlite3 database created with SVNPlot to study the SNA. 

The idea is to create social network graphs by
  1. treating Authors and Files as 'graph nodes'
  2. A link (or edge) is created between the author and the file he created.
  3. A link (or edge) is created between two files when the files are edited as part of single revision
  4. Graphs thus generated are analysed for degree centrality, closeness centrality, how the centrality is changing over time, clusters of files which are generally edited together.
This time also I am using Rietveld project as an example.  And I am using NetworkX (for network analysis), Matplotlib (for generating graphs/outputing network diagrams), GraphViz (for graph layouts), PyDot(for interfacing between the NetworkX and GraphViz). It is amazing how quickly you can prototype something by using the excellent opensource software like NetworkX, Matplotlib etc and a language like Python.

Given below the different graphs generated/analysis and my interpretation. Please note that I am not an exert on Social Network analysis (SNA). So if you find a mistake, please add it in comment and I will correct it.

Some background on RietVeld Project : This project is started in May 2008. So data is available for 8-9 months. It is hosted on Google code and written in Python. The analysis is based on data upto 25 Jan 2009 (i.e. upto revision 392)

Author Network Graphs:

First lets look at the Author Network Graphs.  The assumptions are
  1. Author and files are treated as nodes
  2. When author edits/adds a file a link/edge is created between the author and file
  3. Number of edits (commits) are treated as 'weight' of the link
  4. I have removed the edges with weight 1 to reduce the chances of errors.
Results and Analysis
  1. Graph of Author's Network:
    NOTE : Clicking on above image will open the same image in SVG format. The file nodes in the SVG image are 'clickable'.  Clicking on the file nodes in SVG file will take you to the latest version of the file in subversion repository

    There are 3 principle authors Guido, John Abdelmalek, Andi Albrecht.  Length of edges represent the weight of the edge.  (The file names are not displayed to avoid cluttering).

    There From the graph it seems each is editing/working a distinct set of files (or modules). Few files are edited by all three. However, mainly each is editing a 'his own' set of files.  On a project it will help us identify (a)  who are critical contributors and which modules are they are mainly working on (c) what are the files known only to these critical contributors.

  2. Graphs of Change in Author Centrality values over time

    Degree Centrality
    Closeness Centrality
    From the centrality graphs you can see that the 'centrality' of Guido is decreasing overtime and centrality of John and Andi is increasing. In a way, it shows that dependancy on single person (Guido) is slowly reducing.The closeness centrality is around 0.5. It shows that any author can know more details about a file within 2 steps of contacting another authors (developers).  A very low closeness centrality will also indicate problems.

File Network Graphs:

The assumptions are
  1. Files are treated as nodes
  2. A link/edge is created between files edited in a same revision.
  3. Number of edits (commits) are treated as 'weight' of the link
  4. I have removed the edges with weight 1 to reduce the chances of errors.
  5. Extracted independant sub-graphs from big graphs.
  1. Independant Subgraphs and Their Minimum Spanning Trees of File Network:

    File represented by Center Nodes are modified more frequently and always with other files in the group. I think this information will help in refactoring as well indentifying 'logical' dependancies in files. We can think of these files as 'critical files' in the group.


    Since the graph layout looks confusing because of too many edges, I have also created a 'minimum spanning trees' of these graphs. MST is much more clearer to view. I have also added filenames to the MST images.  MST images give a better idea of how files are linked to each other and it is easier to visually identify 'clusters'


    Nodes in blue color are center nodes.

     Graph File<->File Connections
     Minimum Spanning Tree of File Graph
     Center nodes and Centrality
     
     
    1. /trunk/templates/edit.html : 0.304636
    2. /trunk/templates/diff2.html : 0.291139
    3. /trunk/templates/all.html : 0.289308
    4. /trunk/templates/patch.html : 0.289308
    5. /trunk/static/styles.css : 0.283951
    6. /trunk/templates/diff.html : 0.280488
    7. /trunk/templates/issue.html : 0.272189
    8. /trunk/codereview/urls.py : 0.264368
    9. /trunk/codereview/models.py : 0.258427
    10. /trunk/codereview/views.py : 0.235897
     

    1. /branches/testing/templates/mine.html : 0.326087
    2. /branches/testing/py_zipimport.py : 0.323741
    3. /branches/testing/make_release.sh : 0.323741
    4. /branches/testing/templates/mails/comment.txt : 0.321429
    5. /branches/testing/templates/mails/review.txt : 0.319149
    6. /branches/testing/index.yaml : 0.312500
    7. /branches/testing/codereview/library.py : 0.308219
    8. /branches/testing/templates/diff.html : 0.306122
    9. /branches/testing/codereview/middleware.py : 0.306122
    10. /branches/testing/settings.py : 0.304054
    11. /branches/testing/templates/diff2.html : 0.302013
    12. /branches/testing/codereview/urls.py : 0.302013
    13. /branches/testing/templates/edit.html : 0.300000
    14. /branches/testing/templates/base.html : 0.298013
    15. /branches/testing/templates/patch.html : 0.292208
    16. /branches/testing/main.py : 0.292208
    17. /branches/testing/templates/all.html : 0.292208
    18. /branches/testing/TODO : 0.292208
    19. /branches/testing/static/script.js : 0.290323
    20. /branches/testing/templates/inline_comment.html : 0.290323
    21. /branches/testing/codereview/models.py : 0.288462
    22. /branches/testing/templates/issue.html : 0.288462
    23. /branches/testing/codereview/engine.py : 0.288462
    24. /branches/testing/codereview/views.py : 0.288462
    25. /branches/testing/codereview/patching.py : 0.286624
    26. /branches/testing/static/upload.py : 0.286624
    27. /branches/testing/codereview/intra_region_diff.py : 0.286624
    28. /branches/testing/app.yaml : 0.277778
    29. /branches/testing/Makefile : 0.276074
    30. /branches/testing : 0.272727
     
     
    1. /branches/chromium/codereview/models.py : 0.343750
    2.  /branches/chromium/codereview/views.py : 0.333333

  2. Treemap of Commit Count Vs Centrality
    I have generated a treemap of Subgraphs and filepath related commit count and centrality values. Check it here