Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Open-Source SPL that Can Execute SQL without RDB

0.00/5 (No votes)
24 Jun 2022 1  
SPL provides a syntax equivalent to the SQL92 standard and can perform rich and diverse data calculations. You can directly execute SQL by using TXT/CSV/JSON/XML/XLS/ Web Service/ MongoDB/ Salesforce… as data tables.
This article is about an open-source SPL that can execute SQL without RDB.

SQL syntax is close to natural language, with a low learning threshold and the bonus of first-mover advantage, it soon became popular between database manufacturers and users. After years of development, SQL has become the most widely used and most mature structured data computing language.

However, SQL must work based on RDB, and there is no RDB in many scenarios, such as encountering CSV \ restful JSON \ MongoDB and other data sources or performing mixed calculations between these data sources, such as CSV and XLS. In these scenarios, many people will choose to hard code algorithms in high-level languages such as Java or C#, etc., which requires writing lengthy underlying functions from scratch, and the execution efficiency is difficult to guarantee. It is easy to accumulate the "code sh*t mountain" that everyone hates. Some people load data into the database and then use SQL for calculation, but the loading process is very cumbersome and the real-time performance is also very poor. Sometimes, you have to turn to ETL tools. The framework is aggravated, the risk is increased, and it is doubly troublesome to do mixed calculations.

Now, here comes esProc SPL, and these problems can be easily solved.

SPL is an open-source computing technology, which fully covers the computing power of SQL and supports a wide variety of data sources. SQL can now be used for structured data computing without RDB.

Perfect SQL Computing Power

SPL provides a syntax equivalent to the SQL92 standard and can perform rich and diverse data calculations, including filtering, calculating fields, selecting some columns, renaming, etc. You can directly execute SQL by using files such as text and XLS as data tables. Let's take the CSV file as the data source as an example:

  1. Filtering

    Basic comparison operation:

    SQL
    $select * from d:/Orders.csv where Amount>=100

    like:

    SQL
    $select * from d:/Orders.csv where Client like '%bro%'

    Null value judgment:

    SQL
    $select * from d:/Orders.csv where Client is null

    Logical operators such as and, or and not can combine comparison operations to realize combined filtering:

    SQL
    $select * from d:/Orders.csv 
    where not Amount>=100 and Client like 'bro' or OrderDate is null

    in

    SQL
    $select * from d:/Orders.csv where Client in ('TAS','KBRO','PNS') 

    Multi-layer parentheses:

    SQL
    $select * from d:/Orders.csv 
    where (OrderDate<date('2020-01-01') and Amount<=100) 
    or (OrderDate>=date('2020-12-31') and Amount>100)
  2. Calculating columns

    SPL has rich mathematical functions, string functions, and date functions:

    SQL
    $select round(Amount,2), price*quantity from d:/Orders.csv 
    $select left(Client,4) from d:/Orders.csv 
    $select year(OrderDate) from d:/Orders.csv

    case when

    SQL
    $select case year(OrderDate) 
     when 2021 then 'this year' 
     when 2020 then 'last year' 
     else 'previous years' end 
     from d:/Orders.csv 

    coalesce

    SQL
    $select coalesce(Client,'unknown') from d:/Orders.csv 
  3. SELECT
    SQL
    $select OrderId, Amount, OrderDate from d:/Orders.csv 
  4. ORDER BY
    SQL
    $select * from d:/Orders.csv order by Client, Amount desc 
  5. DISTINCT
    SQL
    $select distinct Client ,Sellerid from d:/Orders.csv 
  6. GROUP BY … HAVING
    SQL
    $select year(OrderDate),Client ,sum(Amount),count(1) 
    from d:/Orders.csv group by year(OrderDate),Client having sum(Amount)<=100 

    Aggregation functions include sum, count, avg, max, and min. Aggregation can be directly done without grouping:

    SQL
    $select avg(Amount) from d:/Orders.csv 
  7. JOIN

    Left join:

    SQL
    $select o.OrderId,o.Client,e.Name e.Dept,e.EId 
    from d:/Orders.txt o left join d:/Employees.txt e on o.SellerId=e.Eid 

    Right join:

    SQL
    $select o.OrderId,o.Client,e.Name e.Dept,e.EId 
    from d:/Employees.txt e right join d:/Orders.txt o on o.SellerId=e.Eid 

    Full join:

    SQL
    $select o.OrderId,o.Client,e.Name e.Dept,e.EId 
    from d:/Employees.txt e full join d:/Orders.txt o on o.SellerId=e.EId

    Inner join:

    SQL
    $select o.OrderId,o.Client,e.Name e.Dept 
    from d:/Orders.csv o inner join d:/Employees.csv e on o.SellerId=e.Eid 

    Inner join can also be written in the form of where:

    SQL
    $select o.OrderId,o.Client,e.Name e.Dept 
    from d:/Orders.csv o ,d:/Employees.csv e where o.SellerId=e.Eid 
  8. Subquery
    SQL
    $select t.Client, t.s, ct.Name, ct.address from 
       (select Client ,sum(amount) s from d:/Orders.csv group by Client) t 
    left join ClientTable ct on t.Client=ct.Client 

    with:

    SQL
    $with t as (select Client ,sum(amount) s from d:/Orders.csv group by Client)
      select t.Client, t.s, ct.Name, ct.address from t
     left join ClientTable ct on t.Client=ct.Client 

    Subquery within in:

    SQL
    $select * from d:/Orders.txt o where o.sellerid in (select eid from d:/Employees.txt) 
  9. AS

    Use the 'as' the keyword to rename fields, calculated columns, physical tables, and subqueries:

    SQL
    $select price*quantity as subtotal from d:/detail.csv
  10. Set operations

    Including union, union all, intersect, minus. Here is an example:

    SQL
    $select * from Orders1.csv union all select * from Orders2.csv
  11. into

    The query results can be written to the file with the keyword 'into':

    SQL
    $select dept,count(1) c,sum(salary) s into deptResult.xlsx 
    from employee.txt group by dept having s>100000

Rich Data Sources Support

SPL supports various non-database data sources, including text in various non-standard formats. CSV has been shown in the previous examples. Tab-separated TXT can also be supported, and SPL will process it automatically according to the extension:

SQL
$select * from d:/Orders.txt where Amount>=100 and Client like 'bro' or OrderDate is null 

If the separator is not a comma or tab, you need to use the SPL extension function. For example, the separator is a colon:

SQL
$select * from {file("d:/Orders.txt").import@t (;":")} 
where Amount>=100 and Client like 'bro' or OrderDate is null 

For files without title lines, column names can be represented by serial numbers:

SQL
$select * from {file("d:/Orders.txt").import()} where _4>=100 and _2 like 'bro' or _5 is null

Some strings in special formats should also be parsed with extension functions. For example, the date format is not standard yyyy-MM-dd:

SQL
$select year(OrderDate),sum(Amount) from 
{file("d:/Orders.txt").import@t(orderid,client,sellerid,amount,orderdate:date:"dd-MM-yyyy")}
group by year(OrderDate) 

SQL can also be executed on Excel files. For Excel with a standard format, you only need to directly reference the file name:

SQL
$select * from d:/Orders.xlsx where Amount>=100 and Client like 'bro' or OrderDate is null 

You can also read the specified sheet:

SQL
$select * from {file("D:/Orders.xlsx").xlsimport@t (;"sheet3")} 
where Amount>=100 and Client like 'bro' or OrderDate is null

CSV / XLS file downloaded from the remote website:

SQL
$select * from {httpfile("http://127.0.0.1:6868/Orders.csv).import@tc() } 
where Amount>=100 and Client like 'bro' or OrderDate is null

HTTP protocol has many features, such as character set, port number, post parameter, header parameter, login authentication, etc. SPL extension functions can support all of them. The extension function can also grab the table data on the web page and support downloading files from the FTP server, which will not be elaborated on here.

The JSON file will be read as a string before parsing:

SQL
$select * from {json(file("d:\\data.json").read())} 
where Amount>=100 and Client like 'bro' or OrderDate is null 

There are few two-dimensional JSON, and multi-layer is the norm. The SPL extension function can convert multi-layer data into two-dimensional records and then calculate them with SQL. The details will not be explained here.

Restful JSON

SQL
$select * from {json(httpfile("http://127.0.0.1:6868/api/getData").read())} 
where Amount>=100 and Client like 'bro' or OrderDate is null

If there are many and long extension functions, they can be written in the step-by-step form:

  A
1 =httpfile("http://127.0.0.1:6868/api/getData")
2 =A1.read()
3 =json(A2)
4 $select * from {A3} where Amount>=100 and Client like 'bro' or OrderDate is null

Similar to CSV / XLS, SPL can also read JSON / XML files on HTTP websites.

XML

SQL
$select * from {xml(file("d:/data.xml").read(),"xml/row")} 
where Amount>=100 and Client like 'bro' or OrderDate is null

Web Service

SQL
$select * from {ws_call(ws_client("http://.../entityWS.asmx?wsdl"),
 "entityWS ":" entityWSSoap":"getData")} 
 where Amount>=100 and Client like'bro' or OrderDate is null

SPL can also support NoSQL.

MongoDB

SQL
$select * from {mongo_shell@x
 (mongo_open("mongodb://127.0.0.1:27017/mongo"),"main.find()")} 
 where Amount>=100 and Client like 'bro' or OrderDate is null

There are often multi-layer data in MongoDB, including restful and web service, and they all can be converted into two-dimensional data with SPL extension functions.

Salesforce

SQL
$select * from {sf_query(sf_open(),"/services/data/v51.0/query",
 "Select Id,CaseNumber,Subject From Case where Status='New'")} 
 where Amount>=100 and Client like 'bro' or OrderDate is null

Hadoop HDFS csv/xls/json/xml

  A
1 =hdfs_open(;"hdfs://192.168.0.8:9000")
2 =hdfs_file(A1,"/user/Orders.csv":"GBK")
3 =A2.import@t()
4 =hdfs_close(A1)
5 $select Client,sum(Amount) from {A3} group by Client

HBase

  A
1 =hbase_open("hdfs://192.168.0.8", "192.168.0.8")
2 =hbase_scan(A1,"Orders")
3 =hbase_close(A1)
4 $select Client,sum(Amount) from {A2} group by Client

HBase also has access methods such as filter and CMP, which can be supported by SPL.

Hive has a public JDBC interface, but its performance is poor. SPL provides a high-performance interface:

  A
1 =hive_client("hdfs://192.168.0.8:9000","thrift://192.168.0.8:9083","hive","asus")
2 =hive_query(A1, "select* fromtable")
3 =hive_close()
4 $select Client,sum(Amount) from {A2} group by Client

Spark

  A
1 =spark_client("hdfs://192.168.0.8:9000","thrift://192.168.0.8:9083","aa")
2 =spark_query(A1,"select * from tablename")
3 =spark_close(A1)
4 $select Client,sum(Amount) from {A2} group by Client

Alibaba cloud

  A
1 =ali_open("http://test.ots.aliyuncs.com","LTAIXZNG5zzSPHTQ","sa","test")
2 =ali_query@x(A1,"test",["id1","id2"],[1,"10001"]:[10,"70001"], ["id1","id2","f1","f2"],f1>=2000.0)
3 $select Client,sum(Amount) from {A2} group by Client

Cassandra

  A
1 =stax_open("127.0.0.1":9042,"mycasdb","cassandra":"cassandra")
2 =stax_query(A1,"select * from user where id=?",1)
3 =stax_close(A1)
4 $select Client,sum(Amount) from {A2} group by Client

ElasticSearch

  A
1 =es_open("localhost:9200","user":"un1234")
2 =es_get(A1,"/person/_mget","{\"ids\":[\"1\",\"2\",\"5\"]}")
3 =es_close(A1)
4 $select Client,sum(Amount) from {A2} group by Client

Redis

  A
1 =redis_open()
2 =redis_hscan(A1, "runoobkey", "v*", 3)
3 =redis_close (A1)
4 $select key,value from {A2} where value>=2000 and value<3000

SAP BW

  A
1 =sap_open("userName","passWord","192.168.0.188","00","000",”E")
2 =sap_cursor(A1, "Z_TEST1","IT_ROOM").fetch()
3 =sap_close(A1)
4 $select * from {A2} where Vendor like '%software%'

InfluxDB

  A
1 =influx_open("http://127.0.0.1:8086", "mydb", "autogen", "admin", "admin")
2 =influx_query(A1, "SELECT * FROM Orders")
3 =influx_close(A1)
4 $select Client,sum(Amount) from {A2} group by Client

Kafka

  A
1 =kafka_open("D://kafka.properties";"topic-test")
2 =kafka_poll(A1)
3 =kafka_close (A1)
4 $select Client,sum(Amount) from {A2} group by Client

MDX multidimensional database

  A
1

=olap_open("http://192.168.0.178:8088/msmdpump.dll",    "CubeTest","Administrator","admin")

2

=olap_query(A1,"with member [Measures].[AnnualInterestRate] as'[Measures].[SalesAmount]/[Measures].[StandardCost]-1'select  {[Measures].[SalesAmount],[Measures].[StandardCost], [Measures].[AnnualInterestRate]} on columns,{[Order Date].[Calendar Year].[Calendar Year]} on rows from [DataSourceMulti]")

3 =olap_close(A1)
4 $select * from {A2} where SalesAmount>10000

In addition to supporting a wide variety of data sources, SPL can also perform mixed calculations between data sources. For example, between CSV and RDB:

SQL
$select o.OrderId,o.Client,e.Name e.Dept
from d:/Orders.csv o inner join d:/Employees.xls e on o.SellerId=e.Eid

Between MongoDB and database:

  A B
1 =mongo_open("mongodb://127.0.0.1:27017/mongo")
2 =mongo_shell@x(A1,"detail.find()").fetch() =connect("orcl").query@x("select * from main")
3

$select d.title, m.path,sum(d.amount)from {A2} as d left join {B2} as m on d.cat=m.cat group by d.title, m.path

Mixed calculations can be performed between any data sources, and the SQL syntax is not affected by the data sources.

Deeper Computing Power

In fact, the original meaning of SPL is Structure Process Language, which is a language specially used for structured data processing. In the previous examples, some of the syntaxes of SPL itself (those extension functions) have been shown. SQL is just a function provided by SPL, and SPL itself has more powerful and convenient computing power than SQL. Some calculation logic is complex, and it is difficult to code in SQL or even stored procedures, while SPL can complete the calculation with simpler code.

For example, here is a task: calculate the longest consecutive rising days of stock. SQL uses multi-layer nested subqueries and window functions, and the code is lengthy and difficult to understand:

SQL
select max(continuousDays)-1
from (select count(*) continuousDays
    from (select sum(changeSign) over(order by tradeDate) unRiseDays
        from (select tradeDate,
            case when price>lag(price) over(order by tradeDate)
            then 0 else 1 end changeSign
            from AAPL) )
        group by unRiseDays) 

While SPL only needs two lines:

  A B
1 =T("d:/AAPL.xlsx") Read Excel file, the first line is the title
2 =a=0,A1.max(a=if(price>price[-1],a+1,0)) Get the max continuous rising days

For simple calculations, using basic SQL is very convenient, but when the calculation requirements become complex, SQL is not applicable. Even if more functions (such as window functions) are provided, the calculation cannot be simplified. In this case, we recommend that users directly use SPL with concise code instead of writing multi-layer nested complex SQL. For this reason, SQL in SPL only supports the SQL92 standard and does not provide more syntax including window functions.

SQL does not advocate multi-step calculation. It is used to write a calculation task in a large statement, which will increase the difficulty of the task. SPL naturally supports multi-step calculation, and can easily split complex large calculation tasks into simple small tasks, which will greatly reduce the difficulty of coding. For example, find out the top n major customers whose cumulative sales account for half of the total sales, and rank them according to the sales from large to small:

  A B
1 = T("D:/data/sales.csv").sort(amount:-1) Fetch data, sort in descending order
2 =A1.cumulate(amount) Calculate cumulative sequence
3 =A2.m(-1)/2 The last cumulative value is the sum
4 =A2.pselect(~>=A3) The required position (more than half)
5 =A1(to(A4)) Get values by position

Flexible Application Structure

How to use SPL?

For interactive calculation and analysis, SPL has a professional IDE, which not only has complete debugging functions, but also can visually observe the intermediate calculation results of each step:

SPL also supports command line execution and supports any mainstream operating system:

D:\raqsoft64\esProc\bin>esprocx.exe -R select Client,sum(Amount) 
from d:/Orders.csv group by Client
Log level:INFO
ARO     899.0
BDR     4278.8
BON     2564.4
BSF     14394.0
CHO     1174.0
CHOP    1420.0
DYD     1242.0
…

For the calculation in the application, SPL provides a standard JDBC driver and can be easily integrated into Java:

Java
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
PrepareStatement st = conn.prepareStatement
("$select * from employee.txt where SALARY >=? and SALARY<?");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
… 

There will be frequent modifications or complex calculations in the application. SPL allows the code to be placed outside the Java program, which can significantly reduce the code coupling. For example, the above SPL code can be saved as a script file and then called in the form of a stored procedure in Java:

Java
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call getQuery(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000); 
ResultSet result=st.execute();
… 

With open-source SPL, you can easily use SQL without RDB.

Extended Reading

History

  • 24th June, 2022: Initial version

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here