About a year ago, Google made all the open source code on GitHub available within BigQuery and as if that wasn’t enough, you can run a terabyte of queries each month for free!
So in this post, I am going to be looking at all the C# source code on GitHub and what we can find out from it. Handily a smaller, C# only, dataset has been made available (in BigQuery, you are charged per byte read), it’s called fh-bigquery:github_extracts.contents_net_cs and has:
- 5,885,933 unique ‘.cs’ files
- 792,166,632 lines of code (LOC)
- 37.17 GB (37,174,783,891 bytes) of data
So a pretty comprehensive set of C# source files!
The rest of this post will attempt to answer the following questions:
- Tabs or Spaces?
- regions: ‘should be banned’ or ‘okay in some cases’?
- ‘K&R’ or ‘Allman’, where do C# devs like to put their braces?
- Do C# developers like writing functional code?
Then moving onto some less controversial C# topics:
- Which using statements are most widely used?
- What NuGet packages are most often included in a .NET project
- How many lines of code (LOC) are in a typical C# file?
- What is the most widely thrown Exception?
- ‘async/await all the things’ or not?
- Do C# developers like using the var keyword?
Before we end up looking at repositories, not just individual C# files:
- What is the most popular repository with C# code in it?
- Just how many files should you have in a repository?
- What are the most popular C# class names?
- ‘Foo.cs’, ‘Program.cs’ or something else, what’s the most common file name?
If you want to try the queries for yourself (or find my mistakes), all of them are available in this gist. There’s a good chance that my regular expressions miss out some edge-cases, after all Regular Expressions: Now You Have Two Problems!!
Tabs or Spaces?
In the entire data-set, there are 5,885,933 files, but here we only include ones that have more than 10 lines starting with a tab or a space:
Tabs | Tabs % | Spaces | Spaces % | Total |
---|
799,055 | 17.15% | 3,859,528 | 82.85% | 4,658,583 |
Clearly, C# developers (on GitHub) prefer Spaces over Tabs, let the endless debates continue!! (I think some of this can be explained by the fact that Visual Studio uses ‘spaces’ by default).
If you want to see how C# compares to other programming languages, take a look at 400,000 GitHub repositories, 1 billion files, 14 terabytes of code: Spaces or Tabs?
regions: ‘should be banned’ or ‘okay in some cases’?
It turns out that there are an impressive 712,498 C# files (out of 5.8 million) that contain at least one #region
statement (query used), that’s just over 12%. (I’m hoping that a lot of those files have been auto-generated by a tool!)
‘K&R’ or ‘Allman’, Where Do C# devs Like to Put their Braces?
C# developers overwhelmingly prefer putting an opening brace {
on its own line (query used):
separate line | same line | same line (initializer) | | total (with brace) | total (all code) |
---|
81,306,320 (67%) | 40,044,603 (33%) | 3,631,947 (2.99%) | | 121,350,923 (15.32%) | 792,166,632 |
(‘same line initializers’ include code like new { Name = "", .. }
, new [] { 1, 2, 3.. }
)
Do C# Developers Like Writing Functional Code?
This is slightly unscientific, but I wanted to see how widely the Lambda Operator =>
is used in C# code (query). Yes, I know, if you want to write functional code on .NET, you really should use F#, but C# has become more ‘functional’ over the years and I wanted to see how much code was taking advantage of that.
Here are the raw percentiles:
Percentile | % of lines using lambdas |
10 | 0.51 |
25 | 1.14 |
50 | 2.50 |
75 | 5.26 |
90 | 9.95 |
95 | 14.29 |
99 | 28.00 |
So we can say that:
- 50% of all the C# code on GitHub uses
=>
on 2.44% (or less) of their lines
- 10% of all C# files have lambdas on almost 1 in 10 of their lines
- 5% use
=>
on 1 in 7 lines (14.29%)
- 1% of files have lambdas on over 1 in 3 lines (28%) of their lines of code, that’s pretty impressive!
Which using Statements are Most Widely Used?
Now on to a bit more substantial, what are the most widely used using
statements in C# code?
The top 10 look like this (the full results are available):
using statement | count |
using System.Collections.Generic; | 1,780,646 |
using System; | 1,477,019 |
using System.Linq; | 1,319,830 |
using System.Text; | 902,165 |
using System.Threading.Tasks; | 628,195 |
using System.Runtime.InteropServices; | 431,867 |
using System.IO; | 407,848 |
using System.Runtime.CompilerServices; | 338,686 |
using System.Collections; | 289,867 |
using System.Reflection; | 218,369 |
However, as was pointed out, the top 5 are included by default when you add a new file in Visual Studio and many people wouldn’t remove them. The same applies to ‘System.Runtime.InteropServices
’ and ‘System.Runtime.CompilerServices
’ which are included in ‘AssemblyInfo.cs` by default.
So if we adjust the list to take account of this, the top 10 looks like:
using statement | count |
using System.IO; | 407,848 |
using System.Collections; | 289,867 |
using System.Reflection; | 218,369 |
using System.Diagnostics; | 201,341 |
using System.Threading; | 179,168 |
using System.ComponentModel; | 160,681 |
using System.Web; | 160,323 |
using System.Windows.Forms; | 137,003 |
using System.Globalization; | 132,113 |
using System.Drawing; | 127,033 |
Finally, an interesting list is the top 10 using
statements that aren’t System
, Microsoft
or Windows
namespaces:
using statement | count |
using NUnit.Framework; | 119,463 |
using UnityEngine; | 117,673 |
using Xunit; | 99,099 |
using System.Web.Mvc; | 87,622 |
using Newtonsoft.Json; | 81,675 |
using Newtonsoft.Json.Linq; | 29,416 |
using Moq; | 23,546 |
using UnityEngine.UI; | 20,355 |
using UnityEditor; | 19,937 |
using Amazon.Runtime; | 18,941 |
What NuGet Packages are Most Often Included in a .NET Project?
It turns out that there is also a separate dataset
containing all the ‘packages.config’ files on GitHub, it’s called contents_net_packages_config and has 104,808 entries. By querying this, we can see that Json.Net is the clear winner!!
package | count |
Newtonsoft.Json | 45,055 |
Microsoft.Web.Infrastructure | 16,022 |
Microsoft.AspNet.Razor | 15,109 |
Microsoft.AspNet.WebPages | 14,495 |
Microsoft.AspNet.Mvc | 14,236 |
EntityFramework | 14,191 |
Microsoft.AspNet.WebApi.Client | 13,480 |
Microsoft.AspNet.WebApi.Core | 12,210 |
Microsoft.Net.Http | 11,625 |
jQuery | 10,646 |
Microsoft.Bcl.Build | 10,641 |
Microsoft.Bcl | 10,349 |
NUnit | 10,341 |
Owin | 9,681 |
Microsoft.Owin | 9,202 |
Microsoft.AspNet.WebApi.WebHost | 9,007 |
WebGrease | 8,743 |
Microsoft.AspNet.Web.Optimization | 8,721 |
Microsoft.AspNet.WebApi | 8,179 |
How Many Lines of Code (LOC) are in a Typical C# File?
Are C# developers prone to creating huge files that go on for 1000s of lines? Well some are, but fortunately it’s the minority of us!!
Note the Y-axis is ‘lines of code’ and is logarithmic, the raw data is available.
Oh dear, Uncle Bob isn’t going to be happy, whilst 96% of the files have 509 LOC of less, the other 4% don’t!! From Clean Code:
And in case you’re wondering, here’s the Top 10 longest C# files!!
File | Lines |
MarMot/Input/test.marmot.cs | 92663 |
src/CodenameGenerator/WordRepos/LastNamesRepository.cs | 88810 |
cs_inputtest/cs_02_7000.cs | 63004 |
cs_inputtest/cs_02_6000.cs | 54004 |
src/ML NET20/Utility/UserName.cs | 52014 |
MWBS/Dictionary/DefaultWordDictionary.cs | 48912 |
Sources/Accord.Math/Matrix/Matrix.Comparisons1.Generated.cs | 48407 |
UrduProofReader/UrduLibs/Utils.cs | 48255 |
cs_inputtest/cs_02_5000.cs | 45004 |
css/style.cs | 44366 |
What is the Most Widely Thrown Exception?
There’s a few interesting results in this query, for instance who knew that so many ApplicationExceptions
were thrown and NotSupportedException
being so high up the list is a bit worrying!!
Exception | count |
throw new ArgumentNullException | 699,526 |
throw new ArgumentException | 361,616 |
throw new NotImplementedException | 340,361 |
throw new InvalidOperationException | 260,792 |
throw new ArgumentOutOfRangeException | 160,640 |
throw new NotSupportedException | 110,019 |
throw new HttpResponseException | 74,498 |
throw new ValidationException | 35,615 |
throw new ObjectDisposedException | 31,129 |
throw new ApplicationException | 30,849 |
throw new UnauthorizedException | 21,133 |
throw new FormatException | 19,510 |
throw new SerializationException | 17,884 |
throw new IOException | 15,779 |
throw new IndexOutOfRangeException | 14,778 |
throw new NullReferenceException | 12,372 |
throw new InvalidDataException | 12,260 |
throw new ApiException | 11,660 |
throw new InvalidCastException | 10,510 |
‘async/await All the Things’ or Not?
The addition of the async
and await
keywords to the C# language makes writing asynchronous code much easier:
public async Task<int> GetDotNetCountAsync()
{
var html = await _httpClient.DownloadStringAsync("http://dotnetfoundation.org");
return Regex.Matches(html, ".NET").Count;
}
But how much is it used? Using the query below:
SELECT Count(*) count
FROM
[fh-bigquery:github_extracts.contents_net_cs]
WHERE
REGEXP_MATCH(content, r'\sasync\s|\sawait\s')
I found that there are 218,643 files (out of 5,885,933) that have at least one usage of async
or await
in them.
Do C# Developers Like Using the var Keyword?
Less than they use async
and await
, there are 130,590 files that have at least one usage of the var
keyword.
Just How Many Files Should You Have in a Repository?
90% of the repositories (that have any C# files) have 95 files or less. 95% have 170 files or less and 99% have 535 files or less.
(again the Y-axis (# files) is logarithmic)
The top 10 largest repositories, by number of C# files are shown below:
Repository | # Files |
https://github.com/xen2/mcs | 23389 |
https://github.com/mater06/LEGOChimaOnlineReloaded | 14241 |
https://github.com/Microsoft/referencesource | 13051 |
https://github.com/dotnet/corefx | 10652 |
https://github.com/apo-j/Projects_Working | 10185 |
https://github.com/Microsoft/CodeContracts | 9338 |
https://github.com/drazenzadravec/nequeo | 8060 |
https://github.com/ClearCanvas/ClearCanvas | 7946 |
https://github.com/mwilliamson-firefly/aws-sdk-net | 7860 |
https://github.com/151706061/MacroMedicalSystem | 7765 |
What is the Most Popular Repository with C# Code in it?
This time, we are going to look at the most popular repositories (based on GitHub ‘stars’) that contain at least 50 C# files (query used):
repo | stars | files |
https://github.com/grpc/grpc | 11075 | 237 |
https://github.com/dotnet/coreclr | 8576 | 6503 |
https://github.com/dotnet/roslyn | 8422 | 6351 |
https://github.com/facebook/yoga | 8046 | 73 |
https://github.com/bazelbuild/bazel | 7123 | 132 |
https://github.com/dotnet/corefx | 7115 | 10652 |
https://github.com/SeleniumHQ/selenium | 7024 | 512 |
https://github.com/Microsoft/WinObjC | 6184 | 81 |
https://github.com/qianlifeng/Wox | 5674 | 207 |
https://github.com/Wox-launcher/Wox | 5674 | 142 |
https://github.com/ShareX/ShareX | 5336 | 766 |
https://github.com/Microsoft/Windows-universal-samples | 5130 | 1501 |
https://github.com/NancyFx/Nancy | 3701 | 957 |
https://github.com/chocolatey/choco | 3432 | 248 |
https://github.com/JamesNK/Newtonsoft.Json | 3340 | 650 |
Interesting that the top spot is a Google Repository! (the C# files in it are sample code for using the GRPC library from .NET)
What are the Most Popular C# class Names?
Assuming that I got the regex correct, the most popular C# class
names are the following:
Class name | Count |
class C | 182480 |
class Program | 163462 |
class Test | 50593 |
class Settings | 40841 |
class Resources | 39345 |
class A | 34687 |
class App | 28462 |
class B | 24246 |
class Startup | 18238 |
class Foo | 15198 |
Yay for Foo
, just sneaking into the Top 10!!
‘Foo.cs’, ‘Program.cs’ or Something Else, What’s the Most Common File Name?
Finally, let's look at the different class
names used, as with the using
statement they are dominated by the default ones used in the Visual Studio templates:
File | Count |
AssemblyInfo.cs | 386822 |
Program.cs | 105280 |
Resources.Designer.cs | 40881 |
Settings.Designer.cs | 35392 |
App.xaml.cs | 21928 |
Global.asax.cs | 16133 |
Startup.cs | 14564 |
HomeController.cs | 13574 |
RouteConfig.cs | 11278 |
MainWindow.xaml.cs | 11169 |
As always, if you’ve read this far, your present is yet more blog posts to read, enjoy!!
How BigQuery Works (Only Put In At the End of the Blog Post)
BigQuery Analysis of Other Programming Languages
The post Analyzing C# code on GitHub with BigQuery first appeared on my blog Performance is a Feature!
CodeProject