Tracking Code Changes with Motion Bubble Charts and GitSense Timeline Metrics

If you know D3 or have ever looked into it, you'll probably recognize the animated bubble chart below. It's from https://bost.ocks.org/mike/nations, which is a recreation of Hans Rosling's 2006 TED talk chart.

If you are not familar with bubble charts, they are a type of chart that displays three dimensions of data. In the chart below, you have life expectancy (vertical axis), income (horizontal axis), and population (circle size). And as you can probably guess, a motion bubble chart, is one that is composed of multiple bubble charts. You can also add additional dimensions to a bubble chart, by varying circle colors, line thickness, and so on.

Wealth & Health of Nations

The motion bubble chart that Hans Rosling has created, is often praised for bringing life to data and in this blog post, we'll see if we can't do the same for GitLab's Community Edition and Enterprise Edition Git repos.

If you are wondering "Why GitLab?", the answer is simple. They have a very aggressive release schedule. They release on the 22nd of every month, which probably means a lot of churn, which should make for good analzying. But before we can start, we'll have to quickly explain what GitSense Timeline Metrics are.

GitSense Timeline Metrics

In our first blog post, we talked about the different indexing levels in GitSense and what each level meant. If you were to index to level to 7 or 8, GitSense would produce what we call daily, weekly and monthly timelines. These timelines, captures the exact state of a branch at specific points in time. By default, GitSense produces 62 daily, 53 weekly, and 60 monthly timelines, which can be used to instaneously answer questions like:

On day 2016-03-29, what was the source lines of code, not including comment or blank lines, for files x, y and z on branches a, b, and c?

and

On week 20 in 2015, how many lines were added, changed and deleted, that does not include comment of blank lines, for files x, y and z on branches a, b, and c?

and so forth.

Commits and Code Timelines
There are two types of timeline metrics; one based on commits and another based on code. For the code timeline metrics, it is important to note they only track code changes and not committers. For example, if somebody changed the same line, in ten different commits over a course of a week. The weekly code churn for that file would only be one and not ten, since only one line changed from the beginning of the week to the end.

GitSense Timeline Data

If you are interested in viewing the numbers that are used in this blog post, you can find links to the TSV files below. However, be warned, as some of the TSV files, contains over 200,000 rows.

Branch Head File Size
cac03f74 gitlab-ce-5-4-stable.tar 7.6M
e46b644a gitlab-ce-6-9-stable.tar 8.4M
d321305c gitlab-ce-7-14-stable.tar 13M
e63f120e gitlab-ce-8-6-stable.tar 22M
c4da1463 gitlab-ce-8-7-stable.tar 28M
0631a1e9 gitlab-ce-master.tar 28M
8c2b6a6e gitlab-ee-7-14-stable.tar 14M
e4df2ca3 gitlab-ee-8-6-stable.tar 26M
a9a68f67 gitlab-ee-8-7-stable.tar 24M
fb561a15 gitlab-ee-master.tar 24M

Tracking GitLab-CE and GitLab-EE Code Changes from High Above

In this section, we'll go over some motion bubble charts that provides a high level view into GitLab-CE and GitLab-EE. To keep things simple, we'll keep the lowest common grouping to programming languages. Note, since GitSense tracks everything at the file level, you can easily abstract the data, in pretty much anyway you want.
Code Changes by Programming Languages
Y-axis: Source lines of code not including blank or comment lines
X-axis: Cumulative lines added, changed and deleted not including blank or comment lines
Radius: Number of files

GitLab-CE (master)
GitLab-CE (7-14-stable)
GitLab-CE (6-9-stable)
GitLab-CE (5-4-stable)
GitLab-EE (master)
GitLab-EE (7-14-stable-ee)

I think it's fair to say, the above chart doesn't quite have the same impact as Hans Rosling's Wealth & Health of Nations chart, but it does bring to attention some interesting facts, like:

  1. Around August 2012, there was a little supernova. If you do an aggregate search in the TSV files, you'll find a bunch of files that were classified as "other", were deleted. Note, GitSense classifies files that it does not recognize as "other".
  2. At the end of 2012, a bunch of HTML files were deleted. With a little digging in the TSV files, you'll find a bunch of GitLab HTML documents were deleted.
  3. GitLab Enterprise contains more code than the Community Edition. This really shouldn't comes as a surprise though.
  4. GitLab produces a lot of code churn (lines added, changed and deleted) that does not invole comment or blank lines.

As stated earlier, we expected the churn to be high and it is. For the Ruby code in GitLab-EE master, it's more than 2 times the total lines of code. To get a better understanding of what's going on, we'll apply the following filters to the next chart:

  • Only consider changes to Ruby code
  • Ignore code churns that were the result of adding or deleting Ruby files.

We'll also change the Y-Axis, to show same line changes. With this, we'll be able to estimate how much of the churn was the result of modifying existing code vs creating/removing code.

Y-axis: Cumulative changes to the same line of code, not including blank or comment lines
X-axis: Cumulative lines added, changed and deleted that does not include blank or comment lines
Radius: Fixed size across all branches
GitLab-CE (master)
GitLab-CE (7-14-stable)
GitLab-CE (6-9-stable)
GitLab-CE (5-4-stable)
GitLab-EE (master)
GitLab-EE (7-14-stable-ee)

With the changes made, we can see the cumulative code churn dropped by about 100000, which means over 100000 lines of churn were the result of adding/deleting Ruby files.

The vertical axis, which tracks changes made to the same line, is about 40,000. For GitLab-EE, this means close to 180,000 lines of cumulative code churn were the result of adding new lines of code and/or deleting existing lines of code that were not not blank or comment lines.

The chart also shows the bubbles moving faster, as they got closer to present date, which would suggest the rate of churn is increasing. This would make sense, considering GitLab's recent fund raising and hiring efforts.

To better visualize the rate of change, we'll redefine the Y axis to capture monthly code churn. By capturing monthly churn in the Y-axis, we'll be better able to see the ups and downs. With the Y-axis tracking monthly code churn, and not cumulative churn, we can see that starting in 2015, the churn staid consistently above 5,000/month. However, there was also a large spike in GitLab-EE's master branch, which contributed to the faster bubble animation near the end.

Given what we've seen so far, it doesn't look like GitLab's development is slowing down by any stretch of the imagination. And it'll be interesting to see what things look like at the end of the year.

Tracking Code Changes at the File Level

What we've shown so far maybe interesting, but they don't provide any actionable data or insight, that could be used to improve a developers day to day routine. Like knowing how frequently somebody commits and at what time is interesting, they hardly provide any insight into what has changed.

Developers work at the file level and knowing a lot of Ruby files has changed, doesn't tell a lot. To demonstrate how motions bubble charts can be used by developers, we'll use them to answer "What has changed in GitLab-CE, in the last 7 days?"
Seven Days of Code Churn Grouped by Top-level Files and Directories
Y-axis: Cumulative lines added, changed, and deleted, not including blank or comment lines
X-axis: Root level directories and files
Radius: Number of changed files
GitLab-CE (master)
GitLab-CE (8-7-stable)
GitLab-CE (8-6-stable)

To help us answer the question of, what has changed, we've created a motion bubble chart that groups the last seven days of code churn by top level files and directories. And like in the high level charts, we excluded code churns that were the result of adding or deleting files.

With this chart, we can see the app directory, contains the most churn. We can also see the branch for the 8.6 release, which came out a few weeks ago, is still being worked on. And the release branch for 8.7, which is going to be released on the 22nd of April, 2016, has the the most churn.

Since GitLab is predominately written in Ruby, we'll apply a Ruby filter to the above chart, to reduce the code churn noise, from the other programming languages.

And by doing so, we can see it changed things, drastically. Here the spec directory contains the most churn and the app directory, which was first, dropped to third. So what's the reason for the big drop?

If you do an aggregate search in the code churn TSV files for GitLab-CE master, and only consider changes in the app directory from 2016-04-01 to now, you'll get the following code churn breakdown:

  type  | sum 
--------+-----
 ruby   | 308
 coffee | 377
 haml   | 488
 scss   | 942

which shows the Ruby code, had the least amount of churn, compared to the non-Ruby files in that directory.

Something else that is interesting about the app directory, is the churn to circle size ratio. We can see the app circle size, which represents the number of changed files, is about the same size as the spec directory; but the cumulative churn for the app directory is 1/3 less, which means the file changes in the app directory were most likely small ones.

To see if this was the case, we'll drill into the app directory. We'll also redefine the X and Y axis and radius, to make tracking individual files easier.
Y-axis: Source lines of code not including blank or comment lines
X-axis: Cumulative lines added, changed, and deleted, not including blank or comment lines
Radius: Number of times, the file has been modified

And as the above chart shows, most of the code churn in the app directory were small ones, with most being 10 or less. We can also tell by the circle size for app/models/repository.rb, that it was the most frequently changed Ruby file in the app directory.

To see if app/models/repository.rb is a heavily modified file or if the frequent changes in the last 7 days were an anomaly, we'll look at its last 60 days of daily churn.

In this 60 days of daily code churn chart, we can see this file was modified quite frequently. We can also see that, as we got closer to present date, the amount of code churn decreased significantly, which would suggest this file is in maintenance mode.

If we look at the last 7 days of churn, we can see the last two changes were pretty small, which would suggest future changes, would most likely be small as well, which should make for easier code reviews.

Since this is an actively modified file, we'll create a new chart, to track its change history in other release branches and to see who touched it last.
GitLab-CE (master)
GitLab-CE (8-7-stable)
GitLab-CE (8-6-stable)
GitLab-EE (master)
GitLab-EE (8-7-stable-ee)
GitLab-EE (8-6-stable-ee)

And as this last 60 days of daily code churn shows, the app/models/repository.rb file is modified quite frequently across all active release branches, with the most churn in GitLab's Enterprise edition.

Well we hope you found this post, both interesting and informative. And if you've never thought about using software metrics in your day to day routine, we hope this showed you, how you can.

In the future, we'll cover other charting methods, like how you can use heat maps plus GitSense, to quickly identify hotspots in your codes history, filter search results, and so much more. With our Timeline Metrics, we are just scratching the surface, as to what we can do with them, so stay tuned for future posts, to learn more.

Blog Posts

© 2016 SDE Solutions, Inc. All rights reserved.
/