Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / programming / regular-expression

Why I Targeted T-SQL: A Confession

2.67/5 (4 votes)
11 Nov 2021Public Domain5 min read 4.6K  
Silly SQL Tricks, Mistakes and Other Nonsense
This is about misadventures with regular expressions and parsing in SQL, why I did it, why you shouldn't, and some general ravings from a slightly bent codewitch.

Introduction

I worked with T-SQL a long time ago, back in the bad old days when some shops used stored procedures to access anything via a server in the "DMZ". That meant no direct table access in most cases, and all data was validated. The point was to limit access and damage in the case of say, a compromised webserver being used to issue commands to the database.

It was today years old when I learned there are better ways to solve that problem now, and the old way has become largely irrelevant. This is after I built Reggie which targets T-SQL as well as C#.

Still, I'm glad I went through the effort, and I intend to go through the same process with Norm of getting it to function in T-SQL code.

Humor me as I explain.

Curse SQL. I love SQL.

If you've never had to write a stored procedure, consider yourself blessed. They lack things you take for granted in other languages like efficient string manipulation or arrays, and they are just generally creaky and weird to code. The syntax is godawful, impossible to remember, and nearly as hard to explain to someone that has never used it before.

This is perfect.

The language is such a disaster to code in that if I can target it with my code generation tools, I should be able to target pretty much any other language. It's like targetting COBOL or something.

SQL
SET @ch1 = SUBSTRING(@value, @index, 1)
SET @ch = UNICODE(@ch1)
SET @tch = @ch - 0xd800
IF @tch < 0 SET @tch = @tch + 2147483648
IF @tch < 2048
BEGIN
    SET @ch = @ch * 1024
    SET @index = @index + 1
    IF @index >= @valueEnd RETURN -1
    SET @ch2 = SUBSTRING(@value, @index, 1)
    SET @ch = @ch + UNICODE(@ch2) - 0x35fdc00
END

This is the code to read a UTF-32 codepoint (@ch) off of an NTEXT or NVARCHAR character input (represented above by @value). It's horrible, and in large part, it's horrible because there are no unsigned integers, there are no bit shifts, and there are no direct (read efficient) ways to get a single character off of a string.

The SQL statement to navigate a state transition isn't any better:

SQL
SELECT @toState = [SqlTableMatcherStateTransition].[ToStateId] 
    FROM [SqlTableMatcherState] 
    INNER JOIN [SqlTableMatcherStateTransition] ON 
        [SqlTableMatcherState].[StateId]=[SqlTableMatcherStateTransition].[StateId] AND 
        [SqlTableMatcherState].[SymbolId]=[SqlTableMatcherStateTransition].[SymbolId] AND 
        [SqlTableMatcherStateTransition].[BlockEndId]=[SqlTableMatcherState].[BlockEndId] 
    WHERE [SqlTableMatcherState].[SymbolId] = @symbolId AND 
        [SqlTableMatcherState].[StateId] = @state AND 
        [SqlTableMatcherState].[BlockEndId] = @blockId AND 
        [SqlTableMatcherStateTransition].[SymbolId] = @symbolId AND 
        @ch BETWEEN [dbo].[SqlTableMatcherStateTransition].[Min] 
            AND [dbo].[SqlTableMatcherStateTransition].[Max]

In C#, it looks a bit more sane, even though the state tables are collapsed and simply folded into an integer array:

C#
acc = dfa[state++];
tlen = dfa[state++];
for (i = 0; i < tlen; ++i) {
    tto = dfa[state++];
    prlen = dfa[state++];
    for (j = 0; j < prlen; ++j) {
        pmin = dfa[state++];
        pmax = dfa[state++];
        if (ch < pmin) break;
        if (ch <= pmax) {
            // tto contains the index to
            // the valid destination state
            // here is where the result
            // of the SQL query lands
        }
    }
}

Here's where things get fun.

Multilanguage Targeting. Seriously.

The thing is that's all that above mess was rendered using the same "TableMatcher.template" template in Reggie. Here's the whole thing for perspective. You can see how intricate it is - there are a lot of calls in it, most of which invoke templates that are written for a specific target:

C#
dynamic a=Arguments;

a.Comment("Matches text based on a DFA table and block end DFA table");
a.MethodPrologue("None",true,"MatcherMatchReturn","TableMatch","TableMatcherMatchImplParams");
    a.MatcherCreateResultList();
    a.TableMatcherMatchDeclarations();
    a.ReadCodePoint(false);
    a.InputLoopPrologue();
        a.MatcherResetMatch();
        a.TableMachineLoopPrologue();
            a.TableMove(false,false,true);
        a.TableMachineLoopEpilogue();
        a.TableAcceptPrologue();
            a.TableCheckerMatcherGetBlockEnd();
            a.TableIfBlockEndPrologue();
                a.TableStateReset();
                a.InputLoopPrologue();
                    a.TableMachineLoopPrologue();
                        a.TableMove(true,false,true);
                    a.TableMachineLoopEpilogue();
                    a.TableAcceptPrologue();
                        a.MatcherYieldResult();
                        a.BreakInputLoop();
                    a.TableAcceptEpilogue();
                    a.TableRejectPrologue();
                        a.UpdateLineAny();
                        a.AppendCapture();
                        a.ReadCodepoint(false);
                        a.AdvanceCursor();
                    a.TableRejectEpilogue();
                    a.TableStateReset();
                a.InputLoopEpilogue();
                a.TableStateReset();
                a.ContinueInputLoop();
            a.TableIfBlockEndEpilogue();
            a.TableIfNotBlockEndPrologue();
                a.MatcherYieldNonEmptyResult();
            a.TableIfNotBlockEndEpilogue();
        a.TableAcceptEpilogue();
        a.UpdateLineAny();
        a.ReadCodepoint(false);
        a.AdvanceCursor();
        a.TableStateReset();
    a.InputLoopEpilogue();
    a.MatcherReturnResultList();
a.MethodEpilogue();

for(var i = 0;i<((string[])a._symbolTable).Length;++i) {
    var s = ((string[])a._symbolTable)[i];
    if(s!=null) {
        a._symbol = s; // usually needed for the documentation template
        a.MethodPrologue("MatcherMatchDocumentation",false,
                         "MatcherMatchReturn","Match"+s,"MatcherMatchParams");
            a.TableMatcherMatchImplForward(s,i);
        a.MethodEpilogue();
    }
}

Here's "TableMove" template for C# (invoked above, which I tried to put in bold):

ASP.NET
<%@param name="isBlockEnd" type="bool"%>
<%@param name="isChecker" type="bool"%>
<%@param name="isMatcher" type="bool"%><%
dynamic a = Arguments;
string array;
string labelName = isBlockEnd?"block_end":"dfa";
if(!(bool)a.lexer) {
    array = isBlockEnd?"blockEnd":"dfa";
} else {
    array = isBlockEnd?"blockEnd":"TokenizeDfaStateTable";
}
a.Label("start_"+labelName);
%>done = true;
acc = <%=array%>[state++];
tlen = <%=array%>[state++];
for (i = 0; i < tlen; ++i) {
    tto = <%=array%>[state++];
    prlen = <%=array%>[state++];
    for (j = 0; j < prlen; ++j) {
        pmin = <%=array%>[state++];
        pmax = <%=array%>[state++];
        if(ch < pmin) break;
        if (ch <= pmax) {
<%a._indent=(int)a._indent+3;
if(!isChecker) {
    a.UpdateLineAny();
    a.AppendCapture();
}
a.ReadCodepoint(isChecker);
if(!isChecker) {
    a.AdvanceCursor();
}
%>state = tto;
done = false;
<%
if(!isMatcher ) {
    a.SetMatched();
}
%>            goto start_<%=labelName%>;<%a._indent=(int)a._indent-3;%>
        }
    }
}

The template is ASP/ASP.NET-like but it's my own tool that uses these files. Forgive the formatting, as the format of the code is somewhat dependent on the formatting of the output itself, and changing it alters the output.

Here's literally the same "table move" operation expressed in T-SQL:

ASP.NET
<%@param name="isBlockEnd" type="bool"%>
<%@param name="isChecker" type="bool"%>
<%@param name="isMatcher" type="bool"%><%
dynamic a = Arguments;
string labelName = isBlockEnd?"block_end":"dfa";
a.Label("start_"+labelName);
%>SET @done = 1
SET @toState = -1
<%if((bool)a.lexer) {
%>SELECT @toState = [<%=a.@class%>TokenizeStateTransition].[ToStateId] FROM [<%=a.@class%>TokenizeState] INNER JOIN [<%=a.@class%>TokenizeStateTransition] ON [<%=a.@class%>TokenizeState].[StateId]=[<%=a.@class%>TokenizeStateTransition].[StateId] WHERE [<%=a.@class%>TokenizeStateTransition].[BlockEndId]=@blockId AND [<%=a.@class%>TokenizeState].[StateId]=@state AND [<%=a.@class%>TokenizeState].[BlockEndId] = @blockId AND @ch BETWEEN [<%=a.@class%>TokenizeStateTransition].[Min] AND [<%=a.@class%>TokenizeStateTransition].[Max]<%
} else {
%>SELECT @toState = [<%=a.@class%>StateTransition].[ToStateId] FROM [<%=a.@class%>State] INNER JOIN [<%=a.@class%>StateTransition] ON [<%=a.@class%>State].[StateId]=[<%=a.@class%>StateTransition].[StateId] AND [<%=a.@class%>State].[SymbolId]=[<%=a.@class%>StateTransition].[SymbolId] AND [<%=a.@class%>StateTransition].[BlockEndId]=[<%=a.@class%>State].[BlockEndId] WHERE [<%=a.@class%>State].[SymbolId] = @symbolId AND [<%=a.@class%>State].[StateId] = @state AND [<%=a.@class%>State].[BlockEndId] = @blockId AND [<%=a.@class%>StateTransition].[SymbolId] = @symbolId AND @ch BETWEEN [<%=a.@class%>StateTransition].[Min] AND [<%=a.@class%>StateTransition].[Max]<%
}%>
IF @toState <> -1
BEGIN<%a._indent = ((int)a._indent) + 1;%>
SET @state = @toState
SET @done = 0<%
if(!isMatcher) {%>
SET @matched = 1
<%}
if(!isChecker) {
a.UpdateLineAny();
a.AppendCapture();
}
a.ReadCodepoint(isChecker);
if(!isChecker) {
a.AdvanceCursor();
%>GOTO start_<%=labelName%>
<%}
a._indent = ((int)a._indent) - 1;%>END

The long lines here are totally my fault. It's actually a lot more difficult to do multiline statements that are formatted properly using these templates, of which there are about 120 SQL ones. I got tired. Still, with these, we generated the code shown before (and more, as I omitted some of the code in the initial presentation above).

I didn't write just one big template for a matcher in T-SQL and one for C# for a good reason. The reason is I originally did do exactly that and it was failing tests because it was extremely difficult to get these two vary different programming environments to behave the same way with respect to the generated code. This way, I control the structure and flow of the code, and invoke the major operations the same way in every language - even SQL.

Whoops, My Mistake

Originally, I thought targetting T-SQL would be useful. That's probably because I've been out of that particular end of the business for well over a decade. Back then, we needed to do a lot of string manipulation from inside stored procs in certain situations to do things like update multiple rows because you didn't have access to the tables themselves.

It sounds like all of that can indeed stay in the middleware these days, where it belongs.

I wrote 120 templates to target SQL. It wasn't easy. It also wasn't wasted.

The Takeaway

I intend Norm and Reggie to eventually cover a myriad of languages and programming environments, like Python, various and sundry .NET languages, probably JS and C and/or C++, opening up simple parsing and efficient DFA based regex matching and tokenization** to multiple languages and platforms, allowing you to use the same input specification(s) across heterogenous environments with consistent behavior.

**(3 times faster than .NET regex vs. Reggie generated C# code) as well as tokenization which most regex engines don't do out of the box.

To do so, I need a very flexible set of templates to use to target each of these languages. By targeting SQL, which is the weirdest popular language family I could think of, I forced myself to refine these templates such that they can probably work with any of the languages I listed by copying an existing tree (say the one targeting C#) and then porting each individual file. Besides those languages, I can probably also target many others, probably including functional languages like Haskell.

Mistakes are either a total loss, or valuable depending on what you choose to take from them. I learned some things about separation of concerns with respect to modern databases and I also made my code far more flexible than it would have otherwise been because of this mistake. Far from a total loss, it actually put me ahead some, because I let it.

History

  • 11th November, 2021 - Initial submission

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication