Exploring the “social coding” behavior.
Social groups have a system of behaviors occurring within the group or between groups, called “Group dynamics”. GitHub, being a social coding network, is the host to millions of developers and software projects. This report will use openly available data to unveil such behaviors in an intellectual-focused social network.
The growth of GitHub has been significantly different from other OSS communities such as Alioth, Freecode, and Savannah. It has experienced an “explosive growth”, defined as the “outburst-type”. (Yu et al. 2014)
The following chart demonstrates the growth of GitHub, Alioth, Freenode, and Savannah. It’s made by approximating the data points in the pictures (using WebplotDigitizer) embedded in the above-mentioned paper.
We can notice that the GitHub has experienced a big leap in a short time. On the other hand, the growth of other communities gradually slows down after a period of rising.
GitHub keeps promoting the “social coding” ideology, Be Social. They have multiple social features such as networking based on following developers, sharing based on repository forking, and watching a project’s development etc.
The social features are the core factors behind this explosive growth of GitHub, setting it apart from other open source software communities.
The worst thing to happen with your repository is, nobody will see it.
By the way, even worse is that the government will see it and ask you to stop working on it.
Whenever a new repository is created, an issue is filed, or a pull-request is submitted; there exist several mediums to notify interested users about it.
A user starring a repository can propagate it to his/her direct (first-degree) followers. If anyone from those connections stars the repository, it has the chance to appear in the activity feed of their followers and so on…
If someone disconnected (without any connection to the creator) stars a repository, we can say that it has traversed far away in the network or gone viral.
External distribution channels have a significant impact on the popularity of a repository. They allow a repository to be viewed by more people compared to all the users it could have reached by the internal distribution channels of GitHub.
YCombinator’s Hacker News is a news aggregator website focusing on computer science, technology, and entrepreneurship; known for its frontpage blessings. It has a dedicated section called Show HN to share something that you’ve made.
According to Paul Graham, the frontpage works in a self-protecting way, advertising what type of submissions are expected.
Now, everyone wants to see their submission on the frontpage. So in GitHub’s context, they post about their repository on HN as soon as they open-source it.
For the following chart showing the submission delay (the difference between the time of a Hacker News submission and repository creation), I have collected submissions (linking to a GitHub repository) from the past 100 days.
You can observe a long tail in the submission delay distribution. There are 3261 HN stories linking to a GitHub repository in these 100 days, with more than 33% of them (1078) being posted in the first week, and 15% in just 24 hours of repository creation.
It implies that people tend to use these external channels more often, to get visitors.
Let’s consider a recent trending repository minimaxir/big-list-of-naughty-strings, to compare the effect of external and internal distribution channels.
In the following chart, you can see the stars it has received every day since its creation.
You can easily notice 4 local maxima, corresponding to a Hacker News or Reddit submission featuring this repository. It has scored relatively fewer stars on other days, except the day it was created.
If you’re wondering why does it have almost 1000 stars on the day 1 before the first HN submission, keep reading.
In general, external channels have more impact than internal distribution channels.
Now, let’s consider another repository titled pravj/Doga. I’m fortunate enough to have written it. You can explore how Hacker News had contributed to its small success.
I’m choosing this repository because it has relatively less number of stargazers, 229 precisely at the time of data collection. So it’s easy to visualize them, still a little hard to make a responsive visualization given that it’s my first production level encounter with D3.
The following interactive visualization demonstrates the “stargazers network” that contains the stargazers of this repository. You can hover on a node (user) to see its connected stargazers.
The ‘stargazer network’ is a directed graph of all stargazers of the repository.
There exist an edge (Ui → Uj) in the network if Ui is following Uj and has starred the repository after Uj has starred (or created) it.
The size of a node in a row is proportional to its in-degree, the number of users it might have told about the repository by starring it.
Users having a high in-degree (number of GitHub followers) have relatively high probability of spreading the news about the repository inside the GitHub network. Just like the repository creator did on his own, spreading it to a few users before the HN submission.
According to WSJ Graphics’ Anatomy of a hit, Audience size is a key factor in the success of open source software. That’s why minimaxir was able to spread his repository big-list-of-naughty-strings to more people before the first HN submission because he has more followers than me (@pravj).
We can label the stargazers at network distance N as disconnected stargazers because they have no following relation to other users, out-degree being 0 for them. There is a total of 139 such stargazers.
Disconnected stargazers at network distance N don’t have any successors, that means they have discovered the repository either from the GitHub search or somewhere out of the GitHub network.
GitHub models user-actions on the platform as various events, they may or may not appear in the public activity timeline based on the event types.
MemberEvent is one such event, triggered when a user is added or removed as a collaborator to a repository or has their permissions changed. (GitHub API v3)
Whenever a user Ui adds another user Uj as a collaborator to the repository R, all the followers of Uj will receive it in their activity feed.
Let’s relate it with “Influence Maximization” (Kempe et al. 2003). It’s a widely studied problem in social networks, trying to find a subset of users (seed users) to adopt a new idea or product to trigger a cascade of further adoptions via social influence. The problem is to locate those seed users so that the total number of adoptions can be maximized.
As a solution to the problem, some GitHub users used to add most-followed users as collaborators. Because it would broadcast about the repository to all their followers, eventually getting more viewers to the repository.
A user Ui adding another user Uj to the repository R as a collaborator is a “collaboration spam” if Uj has zero commits in the repository and Uj does not follow Ui.
The following chart shows how many times some popular users were the target of a collaboration spam (in a year).
You can feel how annoying it would have been for the popular users, especially for the most-followed Linus Torvalds.
The ‘repository invitation’ feature has reduced the ‘collaboration spam’ on GitHub significantly.
That’s my exploration of the information dynamics on GitHub.
Well, I do have some questions unanswered. For now, I will keep them as secret. Also, I’m envious of the data people at GitHub, because they have access to much more data, I can only dream to have.
I am thinking of collecting additional data using some other ways, though I don’t have the format ready now, just a thought. You can keep an eye on my GitHub to get the updates.
The source code for this report is available at pravj/github-dynamics.