Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Analyzing C# Code on GitHub with BigQuery

0.00/5 (No votes)
13 Oct 2017 1  
Analyzing C# code on GitHub with BigQuery

About a year ago, Google made all the open source code on GitHub available within BigQuery and as if that wasn’t enough, you can run a terabyte of queries each month for free!

So in this post, I am going to be looking at all the C# source code on GitHub and what we can find out from it. Handily a smaller, C# only, dataset has been made available (in BigQuery, you are charged per byte read), it’s called fh-bigquery:github_extracts.contents_net_cs and has:

  • 5,885,933 unique ‘.cs’ files
  • 792,166,632 lines of code (LOC)
  • 37.17 GB (37,174,783,891 bytes) of data

So a pretty comprehensive set of C# source files!

The rest of this post will attempt to answer the following questions:

  1. Tabs or Spaces?
  2. regions: ‘should be banned’ or ‘okay in some cases’?
  3. ‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
  4. Do C# developers like writing functional code?

Then moving onto some less controversial C# topics:

  1. Which using statements are most widely used?
  2. What NuGet packages are most often included in a .NET project
  3. How many lines of code (LOC) are in a typical C# file?
  4. What is the most widely thrown Exception?
  5. ‘async/await all the things’ or not?
  6. Do C# developers like using the var keyword?

Before we end up looking at repositories, not just individual C# files:

  1. What is the most popular repository with C# code in it?
  2. Just how many files should you have in a repository?
  3. What are the most popular C# class names?
  4. ‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?

If you want to try the queries for yourself (or find my mistakes), all of them are available in this gist. There’s a good chance that my regular expressions miss out some edge-cases, after all Regular Expressions: Now You Have Two Problems!!

Tabs or Spaces?

In the entire data-set, there are 5,885,933 files, but here we only include ones that have more than 10 lines starting with a tab or a space:

TabsTabs %SpacesSpaces %Total
799,05517.15%3,859,52882.85%4,658,583

Clearly, C# developers (on GitHub) prefer Spaces over Tabs, let the endless debates continue!! (I think some of this can be explained by the fact that Visual Studio uses ‘spaces’ by default).

If you want to see how C# compares to other programming languages, take a look at 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?

regions: ‘should be banned’ or ‘okay in some cases’?

It turns out that there are an impressive 712,498 C# files (out of 5.8 million) that contain at least one #region statement (query used), that’s just over 12%. (I’m hoping that a lot of those files have been auto-generated by a tool!)

‘K&R’ or ‘Allman’, Where Do C# devs Like to Put their Braces?

C# developers overwhelmingly prefer putting an opening brace { on its own line (query used):

separate linesame linesame line (initializer) total (with brace)total (all code)
81,306,320 (67%)40,044,603 (33%)3,631,947 (2.99%) 121,350,923 (15.32%)792,166,632

(‘same line initializers’ include code like new { Name = "", .. }, new [] { 1, 2, 3.. })

Do C# Developers Like Writing Functional Code?

This is slightly unscientific, but I wanted to see how widely the Lambda Operator => is used in C# code (query). Yes, I know, if you want to write functional code on .NET, you really should use F#, but C# has become more ‘functional’ over the years and I wanted to see how much code was taking advantage of that.

Here are the raw percentiles:

Percentile% of lines using lambdas
100.51
251.14
502.50
755.26
909.95
9514.29
9928.00

So we can say that:

  • 50% of all the C# code on GitHub uses => on 2.44% (or less) of their lines
  • 10% of all C# files have lambdas on almost 1 in 10 of their lines
  • 5% use => on 1 in 7 lines (14.29%)
  • 1% of files have lambdas on over 1 in 3 lines (28%) of their lines of code, that’s pretty impressive!

Which using Statements are Most Widely Used?

Now on to a bit more substantial, what are the most widely used using statements in C# code?

The top 10 look like this (the full results are available):

using statementcount
using System.Collections.Generic;1,780,646
using System;1,477,019
using System.Linq;1,319,830
using System.Text;902,165
using System.Threading.Tasks;628,195
using System.Runtime.InteropServices;431,867
using System.IO;407,848
using System.Runtime.CompilerServices;338,686
using System.Collections;289,867
using System.Reflection;218,369

However, as was pointed out, the top 5 are included by default when you add a new file in Visual Studio and many people wouldn’t remove them. The same applies to ‘System.Runtime.InteropServices’ and ‘System.Runtime.CompilerServices’ which are included in ‘AssemblyInfo.cs` by default.

So if we adjust the list to take account of this, the top 10 looks like:

using statementcount
using System.IO;407,848
using System.Collections;289,867
using System.Reflection;218,369
using System.Diagnostics;201,341
using System.Threading;179,168
using System.ComponentModel;160,681
using System.Web;160,323
using System.Windows.Forms;137,003
using System.Globalization;132,113
using System.Drawing;127,033

Finally, an interesting list is the top 10 using statements that aren’t System, Microsoft or Windows namespaces:

using statementcount
using NUnit.Framework;119,463
using UnityEngine;117,673
using Xunit;99,099
using System.Web.Mvc;87,622
using Newtonsoft.Json;81,675
using Newtonsoft.Json.Linq;29,416
using Moq;23,546
using UnityEngine.UI;20,355
using UnityEditor;19,937
using Amazon.Runtime;18,941

What NuGet Packages are Most Often Included in a .NET Project?

It turns out that there is also a separate dataset containing all the ‘packages.config’ files on GitHub, it’s called contents_net_packages_config and has 104,808 entries. By querying this, we can see that Json.Net is the clear winner!!

packagecount
Newtonsoft.Json45,055
Microsoft.Web.Infrastructure16,022
Microsoft.AspNet.Razor15,109
Microsoft.AspNet.WebPages14,495
Microsoft.AspNet.Mvc14,236
EntityFramework14,191
Microsoft.AspNet.WebApi.Client13,480
Microsoft.AspNet.WebApi.Core12,210
Microsoft.Net.Http11,625
jQuery10,646
Microsoft.Bcl.Build10,641
Microsoft.Bcl10,349
NUnit10,341
Owin9,681
Microsoft.Owin9,202
Microsoft.AspNet.WebApi.WebHost9,007
WebGrease8,743
Microsoft.AspNet.Web.Optimization8,721
Microsoft.AspNet.WebApi8,179

How Many Lines of Code (LOC) are in a Typical C# File?

Are C# developers prone to creating huge files that go on for 1000s of lines? Well some are, but fortunately it’s the minority of us!!

Percentiles of lines of code per file

Note the Y-axis is ‘lines of code’ and is logarithmic, the raw data is available.

Oh dear, Uncle Bob isn’t going to be happy, whilst 96% of the files have 509 LOC of less, the other 4% don’t!! From Clean Code:

Uncle Bob - Clean Code - Number of lines of code in a file

And in case you’re wondering, here’s the Top 10 longest C# files!!

FileLines
MarMot/Input/test.marmot.cs92663
src/CodenameGenerator/WordRepos/LastNamesRepository.cs88810
cs_inputtest/cs_02_7000.cs63004
cs_inputtest/cs_02_6000.cs54004
src/ML NET20/Utility/UserName.cs52014
MWBS/Dictionary/DefaultWordDictionary.cs48912
Sources/Accord.Math/Matrix/Matrix.Comparisons1.Generated.cs48407
UrduProofReader/UrduLibs/Utils.cs48255
cs_inputtest/cs_02_5000.cs45004
css/style.cs44366

What is the Most Widely Thrown Exception?

There’s a few interesting results in this query, for instance who knew that so many ApplicationExceptions were thrown and NotSupportedException being so high up the list is a bit worrying!!

Exceptioncount
throw new ArgumentNullException699,526
throw new ArgumentException361,616
throw new NotImplementedException340,361
throw new InvalidOperationException260,792
throw new ArgumentOutOfRangeException160,640
throw new NotSupportedException110,019
throw new HttpResponseException74,498
throw new ValidationException35,615
throw new ObjectDisposedException31,129
throw new ApplicationException30,849
throw new UnauthorizedException21,133
throw new FormatException19,510
throw new SerializationException17,884
throw new IOException15,779
throw new IndexOutOfRangeException14,778
throw new NullReferenceException12,372
throw new InvalidDataException12,260
throw new ApiException11,660
throw new InvalidCastException10,510

‘async/await All the Things’ or Not?

The addition of the async and await keywords to the C# language makes writing asynchronous code much easier:

public async Task<int> GetDotNetCountAsync()
{
    // Suspends GetDotNetCountAsync() to allow the caller (the web server)
    // to accept another request, rather than blocking on this one.
    var html = await _httpClient.DownloadStringAsync("http://dotnetfoundation.org");

    return Regex.Matches(html, ".NET").Count;
}

But how much is it used? Using the query below:

SELECT Count(*) count
FROM
  [fh-bigquery:github_extracts.contents_net_cs]
WHERE
  REGEXP_MATCH(content, r'\sasync\s|\sawait\s')

I found that there are 218,643 files (out of 5,885,933) that have at least one usage of async or await in them.

Do C# Developers Like Using the var Keyword?

Less than they use async and await, there are 130,590 files that have at least one usage of the var keyword.

Just How Many Files Should You Have in a Repository?

90% of the repositories (that have any C# files) have 95 files or less. 95% have 170 files or less and 99% have 535 files or less.

Number of C# Files per Repository

(again the Y-axis (# files) is logarithmic)

The top 10 largest repositories, by number of C# files are shown below:

Repository# Files
https://github.com/xen2/mcs23389
https://github.com/mater06/LEGOChimaOnlineReloaded14241
https://github.com/Microsoft/referencesource13051
https://github.com/dotnet/corefx10652
https://github.com/apo-j/Projects_Working10185
https://github.com/Microsoft/CodeContracts9338
https://github.com/drazenzadravec/nequeo8060
https://github.com/ClearCanvas/ClearCanvas7946
https://github.com/mwilliamson-firefly/aws-sdk-net7860
https://github.com/151706061/MacroMedicalSystem7765

This time, we are going to look at the most popular repositories (based on GitHub ‘stars’) that contain at least 50 C# files (query used):

repostarsfiles
https://github.com/grpc/grpc11075237
https://github.com/dotnet/coreclr85766503
https://github.com/dotnet/roslyn84226351
https://github.com/facebook/yoga804673
https://github.com/bazelbuild/bazel7123132
https://github.com/dotnet/corefx711510652
https://github.com/SeleniumHQ/selenium7024512
https://github.com/Microsoft/WinObjC618481
https://github.com/qianlifeng/Wox5674207
https://github.com/Wox-launcher/Wox5674142
https://github.com/ShareX/ShareX5336766
https://github.com/Microsoft/Windows-universal-samples51301501
https://github.com/NancyFx/Nancy3701957
https://github.com/chocolatey/choco3432248
https://github.com/JamesNK/Newtonsoft.Json3340650

Interesting that the top spot is a Google Repository! (the C# files in it are sample code for using the GRPC library from .NET)

Assuming that I got the regex correct, the most popular C# class names are the following:

Class nameCount
class C182480
class Program163462
class Test50593
class Settings40841
class Resources39345
class A34687
class App28462
class B24246
class Startup18238
class Foo15198

Yay for Foo, just sneaking into the Top 10!!

‘Foo.cs’, ‘Program.cs’ or Something Else, What’s the Most Common File Name?

Finally, let's look at the different class names used, as with the using statement they are dominated by the default ones used in the Visual Studio templates:

FileCount
AssemblyInfo.cs386822
Program.cs105280
Resources.Designer.cs40881
Settings.Designer.cs35392
App.xaml.cs21928
Global.asax.cs16133
Startup.cs14564
HomeController.cs13574
RouteConfig.cs11278
MainWindow.xaml.cs11169

More Information

As always, if you’ve read this far, your present is yet more blog posts to read, enjoy!!

How BigQuery Works (Only Put In At the End of the Blog Post)

BigQuery Analysis of Other Programming Languages

The post Analyzing C# code on GitHub with BigQuery first appeared on my blog Performance is a Feature!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here