Introduction
Source for a simple and generic SQL formatter class.
For example:
WITH Sales_CTE(SalesPersonID, SalesOrderID, SalesYear)
AS(SELECT SalesPersonID, SalesOrderID, YEAR(OrderDate) AS SalesYear
FROM Sales.SalesOrderHeader WHERE SalesPersonID IS NOT NULL)
SELECT SalesPersonID, COUNT(SalesOrderID) AS TotalSales, SalesYear
FROM Sales_CTE GROUP BY SalesYear, SalesPersonID ORDER BY SalesPersonID, SalesYear;
becomes:
with Sales_CTE(SalesPersonID, SalesOrderID, SalesYear) as
(
select SalesPersonID,
SalesOrderID,
Year(OrderDate) as SalesYear
from Sales.SalesOrderHeader
where SalesPersonID is not NULL
)
select SalesPersonID,
Count(SalesOrderID) as TotalSales,
SalesYear
from Sales_CTE
group by SalesYear, SalesPersonID
order by SalesPersonID, SalesYear;
Background
I recently managed a group of data integration programmers writing a lot of sophisticated SQL. In order to make their code easier for me to understand--and in the hopes of moving the group towards more standardized formatting in general, I started looking for a SQL formatter. And there are plenty of them, except...
I wanted source we could include in the development of our own tools and utilities; that I could modify to our preferences; that was not dialect-specific--and, of course, it had to be free. I may have missed it but I couldn't find a satisfactory solution so I spent a few nights and weekends and came up with this.
I'm sharing this in the hope that others may find it useful, but also hoping to learn a bit. The problem turned out to be pretty thorny and left me with the feeling there may be a better (more formal, structured) way to tackle the problem that I'm missing.
Using the Code
Use of the class is very simple:
var fmtr = new SQL_Formatter.Formatter();
var formattedStatement = fmtr.Format("... unformatted statement..." [, "... options..."]);
if (fmtr.Success)
... use the result ...;
else
throw new exception(fmtr.LastResult);
Instead of throwing parsing errors, a Boolean Success
property is set to false
and a LastResult
property to an informational message.
Method
The Format
method first invokes a private
UnFormat
method to collapse and add whitespace to the input so that it can be parsed as a set of space delimited elements. Each element is then extracted into the token
variable and examined for its syntactic significance.
public string Format(string sql)
...
try { sql = UnFormat(sql); }
catch (Exception ex)
{ return Fail(ex.Message); }
while (stmtIndex < sql.Length)
{
while (stmtIndex < sql.Length && sql.Substring(stmtIndex, 1) == " ")
stmtIndex++;
if (sql.IndexOf(" ", stmtIndex) > -1)
token = sql.Substring(stmtIndex, sql.IndexOf(" ", stmtIndex) - stmtIndex);
else
token = sql.Substring(stmtIndex, sql.Length - stmtIndex);
...
When the element identifies the beginning of a comment or a quoted literal, it's extended--all subsequent elements are simply added to it--until the end of that construct is found. It's then appended to the result and parsing continues with the next element.
Elements that introduce multi-word SQL keywords ("left
", for example, in "left join
") are saved in the previousToken
variable and combined with subsequent elements until the keyword is complete.
Each element is then considered for its role in the formatting (increasing or decreasing indentation, inserting newlines) before being added to the formatted output. Unexpected elements, a "when
" without a preceding "case
", for example, cause the formatting to fail. The helper function Fail
implements the error handling described above.
The level of Indentation is influenced by keywords, parenthesis, etc. and these are tracked separately in local variables: tabLevel
, parenLevel
, etc. The netParens
function evaluates each element's effect on indentation due to parenthesis, and the Tabs
function considers these variables to return the appropriate indentation and vertical whitespace as each element is added to the result. When the end of the statement is reached, any non-zero value in the variables represents invalid syntax in the input (unbalanced parenthesis, etc.) and the formatting fails.
The currentKeyword
stack, indexed by tabLevel
, tracks the nesting of SQL constructs, as this is reflected in indentation. CTEs and case
statements also require special consideration.
In practice, there are simply many variations and subtleties to be considered to get the desired result. For example, having found a "select
" element, it could be part of a CTE, beginning a new statement, subject to a T-SQL or PL/SQL condition, etc. Consider the example at the beginning of this text, and this portion of the code that handles "select
":
...
case "select":
if (cte == tabLevel)
{
token = Tabs(true) + token.ToLower();
cte = -1;
}
else if (currentKeyword[tabLevel] == "")
token = Tabs() + token.ToLower();
else if (currentKeyword[tabLevel] == "if")
token = Tabs(true) + "\t" + token.ToLower();
else if (!currentKeyword[tabLevel].In(new string[] _
{ "select", "insert", "insert into", "if" }))
token = (result.gtr("") & _
result.Right(4) != Str.Repeat(Str.NewLine, 2) ? Str.NewLine : "") + _
Tabs(true, 1) + token.ToLower();
else
token = Tabs(true) + token.ToLower();
...
The "with
" keyword beginning the sample statement indicates that a CTE is being constructed and cte
is set to reflect the current tabLevel
--information that is required when the second "select
" keyword (the reference) is to be formatted. Keywords inside the CTE definition are formatted as usual, and the first "select
" is recognized as requiring leading indentation as it is preceded by an opening parenthesis:
...
if (token.Equals("(select", Str.IgnoreCase))
{
tabLevel++;
token = (result.Right(1) != "\t" ? Tabs(true) : "") + "(" + Str.NewLine + Tabs() + "select";
currentKeyword.Add("select");
currentParens = parenLevel;
}
...
Handling the parens as part of the element found seems inelegant, but it actually works quite well--it's a natural consequence of the space-delimited-element strategy and facilitates distinguishing between parentheses which demarcate components of the statement from those which appear in a function call.
Formatting options are passed as an equals sign / semi-colon delimited string. The currently supported options and default values are:
LeadingCommas = false
LeadingJoins = true
RemoveComments = false
The first two options reflect common developer practices of formatting the SQL so that portions are easily commented out for debugging--as opposed to my original intention of making the SQL easier to read.
In time, I'd expect a Dialect
option will prove necessary or helpful, but I haven't seen the need yet.
Debug
When debugging, the formatter prepends an informational header like this example:
/*
Formatted -- https://www.codeproject.com/Articles/1275027/Csharp-source-for-SQL-formatting
Length: 273
Elapsed: 46 milliseconds
*/
Demo
The solution includes a very simple winforms Demo executable along with the Formatter
class.
Points of Interest
Though intended to support multiple dialects, there is currently a very pronounced T-SQL bias.
The implementation of the formatter employs a number of routines from unrelated libraried code; I've pulled these fragments into an additional file: LIB.cs. In particular, the KVP.List
class used to manage the formatting options provides extensive functionality based on key-value-pairs but is simply spoofed in LIB.cs to support the one method used in this formatting logic: GetBoolean
.
Comments
Comments pose very particular and interesting challenges. This code makes no attempt to format them, but the trouble they cause makes me wonder if that's the right choice.
An interesting component of this problem is that formatting relevant to a comment often appears outside the delimiters which indicate that it's a comment being parsed--as illustrated by the newlines in this example below:
select * /* first comment */
from table
/* second comment */
Further, comments tend to be formatted for readability in the unformatted (or rather--"originally formatted") code. When you change the formatting, the comments appear randomly formatted and consequently distracting and much less helpful.
I've not found a good solution to this problem and currently leave it as a manual cleanup effort.
History
- 24th January, 2019
- 27th January, 2019
- Trued-up discrepancies in the source code
- Added support for formatting options