C# Source for SQL Formatting

Mitchell Vines

4.15/5 (5 votes)

27 Jan 2019CPOL5 min read

21.1K

976

Source code for SQL formatting (now with formatting options)--as opposed to a utility, web service or plug-in/add-on...

Download source files - 24.6 KB

Introduction

Source for a simple and generic SQL formatter class.

For example:

WITH Sales_CTE(SalesPersonID, SalesOrderID, SalesYear) 
AS(SELECT SalesPersonID, SalesOrderID, YEAR(OrderDate) AS SalesYear 
FROM Sales.SalesOrderHeader WHERE SalesPersonID IS NOT NULL) 
SELECT SalesPersonID, COUNT(SalesOrderID) AS TotalSales, SalesYear 
FROM Sales_CTE GROUP BY SalesYear, SalesPersonID ORDER BY SalesPersonID, SalesYear;

becomes:

with Sales_CTE(SalesPersonID, SalesOrderID, SalesYear) as 
   (
   select SalesPersonID,
      SalesOrderID,
      Year(OrderDate) as SalesYear 
   from Sales.SalesOrderHeader 
   where SalesPersonID is not NULL
   ) 
select SalesPersonID,
   Count(SalesOrderID) as TotalSales,
   SalesYear 
from Sales_CTE 
group by SalesYear, SalesPersonID 
order by SalesPersonID, SalesYear;

Background

I recently managed a group of data integration programmers writing a lot of sophisticated SQL. In order to make their code easier for me to understand--and in the hopes of moving the group towards more standardized formatting in general, I started looking for a SQL formatter. And there are plenty of them, except...

I wanted source we could include in the development of our own tools and utilities; that I could modify to our preferences; that was not dialect-specific--and, of course, it had to be free. I may have missed it but I couldn't find a satisfactory solution so I spent a few nights and weekends and came up with this.

I'm sharing this in the hope that others may find it useful, but also hoping to learn a bit. The problem turned out to be pretty thorny and left me with the feeling there may be a better (more formal, structured) way to tackle the problem that I'm missing.

Using the Code

Use of the class is very simple:

var fmtr = new SQL_Formatter.Formatter();
var formattedStatement = fmtr.Format("... unformatted statement..." [, "... options..."]);
if (fmtr.Success)
    ... use the result ...;
else
    throw new exception(fmtr.LastResult);

Instead of throwing parsing errors, a Boolean Success property is set to false and a LastResult property to an informational message.

Method

The Format method first invokes a private UnFormat method to collapse and add whitespace to the input so that it can be parsed as a set of space delimited elements. Each element is then extracted into the token variable and examined for its syntactic significance.

public string Format(string sql)

...

// Remove formatting
try { sql = UnFormat(sql); }
catch (Exception ex)
{ return Fail(ex.Message); }

// Parse the statement
while (stmtIndex < sql.Length)
{
  // Skip leading spaces
  while (stmtIndex < sql.Length && sql.Substring(stmtIndex, 1) == " ")
    stmtIndex++;

  // Grab the next token, space delimited
  if (sql.IndexOf(" ", stmtIndex) > -1)
    token = sql.Substring(stmtIndex, sql.IndexOf(" ", stmtIndex) - stmtIndex);
  else
    token = sql.Substring(stmtIndex, sql.Length - stmtIndex);
...

When the element identifies the beginning of a comment or a quoted literal, it's extended--all subsequent elements are simply added to it--until the end of that construct is found. It's then appended to the result and parsing continues with the next element.

Elements that introduce multi-word SQL keywords ("left", for example, in "left join") are saved in the previousToken variable and combined with subsequent elements until the keyword is complete.

Each element is then considered for its role in the formatting (increasing or decreasing indentation, inserting newlines) before being added to the formatted output. Unexpected elements, a "when" without a preceding "case", for example, cause the formatting to fail. The helper function Fail implements the error handling described above.

The level of Indentation is influenced by keywords, parenthesis, etc. and these are tracked separately in local variables: tabLevel, parenLevel, etc. The netParens function evaluates each element's effect on indentation due to parenthesis, and the Tabs function considers these variables to return the appropriate indentation and vertical whitespace as each element is added to the result. When the end of the statement is reached, any non-zero value in the variables represents invalid syntax in the input (unbalanced parenthesis, etc.) and the formatting fails.

The currentKeyword stack, indexed by tabLevel, tracks the nesting of SQL constructs, as this is reflected in indentation. CTEs and case statements also require special consideration.

In practice, there are simply many variations and subtleties to be considered to get the desired result. For example, having found a "select" element, it could be part of a CTE, beginning a new statement, subject to a T-SQL or PL/SQL condition, etc. Consider the example at the beginning of this text, and this portion of the code that handles "select":

...
    case "select":
      // Begin a select statement
      // Pushes occur below, see tabLevel
      if (cte == tabLevel)
      {
        // Keep together--prevent the default vertical whitespace
        token = Tabs(true) + token.ToLower();
        cte = -1;
      }
      else if (currentKeyword[tabLevel] == "")
        // New statement
        token = Tabs() + token.ToLower();
      else if (currentKeyword[tabLevel] == "if")
        // SQL conditional
        token = Tabs(true) + "\t" + token.ToLower();
      else if (!currentKeyword[tabLevel].In(new string[] _
      { "select", "insert", "insert into", "if" }))
        // Force vertical whitespace
        token = (result.gtr("") & _
        result.Right(4) != Str.Repeat(Str.NewLine, 2) ? Str.NewLine : "") + _
        Tabs(true, 1) + token.ToLower();
      else
        // Newline only
        token = Tabs(true) + token.ToLower();
...

The "with" keyword beginning the sample statement indicates that a CTE is being constructed and cte is set to reflect the current tabLevel--information that is required when the second "select" keyword (the reference) is to be formatted. Keywords inside the CTE definition are formatted as usual, and the first "select" is recognized as requiring leading indentation as it is preceded by an opening parenthesis:

...
  // Increase tab level -- select
  if (token.Equals("(select", Str.IgnoreCase))
  {
    tabLevel++;
    token = (result.Right(1) != "\t" ? Tabs(true) : "") + "(" + Str.NewLine + Tabs() + "select";
    currentKeyword.Add("select");
    currentParens = parenLevel;
  }
...

Handling the parens as part of the element found seems inelegant, but it actually works quite well--it's a natural consequence of the space-delimited-element strategy and facilitates distinguishing between parentheses which demarcate components of the statement from those which appear in a function call.

Formatting options are passed as an equals sign / semi-colon delimited string. The currently supported options and default values are:

LeadingCommas = false
LeadingJoins = true
RemoveComments = false

The first two options reflect common developer practices of formatting the SQL so that portions are easily commented out for debugging--as opposed to my original intention of making the SQL easier to read.

In time, I'd expect a Dialect option will prove necessary or helpful, but I haven't seen the need yet.

Debug

When debugging, the formatter prepends an informational header like this example:

/*
Formatted -- https://www.codeproject.com/Articles/1275027/Csharp-source-for-SQL-formatting 
Length:   273
Elapsed:  46 milliseconds
*/

Demo

The solution includes a very simple winforms Demo executable along with the Formatter class.

Points of Interest

Though intended to support multiple dialects, there is currently a very pronounced T-SQL bias.

The implementation of the formatter employs a number of routines from unrelated libraried code; I've pulled these fragments into an additional file: LIB.cs. In particular, the KVP.List class used to manage the formatting options provides extensive functionality based on key-value-pairs but is simply spoofed in LIB.cs to support the one method used in this formatting logic: GetBoolean.

Comments

Comments pose very particular and interesting challenges. This code makes no attempt to format them, but the trouble they cause makes me wonder if that's the right choice.

An interesting component of this problem is that formatting relevant to a comment often appears outside the delimiters which indicate that it's a comment being parsed--as illustrated by the newlines in this example below:

select * /* first comment */
from table
/* second comment */

Further, comments tend to be formatted for readability in the unformatted (or rather--"originally formatted") code. When you change the formatting, the comments appear randomly formatted and consequently distracting and much less helpful.

I've not found a good solution to this problem and currently leave it as a manual cleanup effort.

History

24^th January, 2019
- Initial release
27^th January, 2019
- Trued-up discrepancies in the source code
- Added support for formatting options

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)