This article is about an open-source SPL that can execute SQL without RDB.
SQL syntax is close to natural language, with a low learning threshold and the bonus of first-mover advantage, it soon became popular between database manufacturers and users. After years of development, SQL has become the most widely used and most mature structured data computing language.
However, SQL must work based on RDB, and there is no RDB in many scenarios, such as encountering CSV \ restful JSON \ MongoDB and other data sources or performing mixed calculations between these data sources, such as CSV and XLS. In these scenarios, many people will choose to hard code algorithms in high-level languages such as Java or C#, etc., which requires writing lengthy underlying functions from scratch, and the execution efficiency is difficult to guarantee. It is easy to accumulate the "code sh*t mountain" that everyone hates. Some people load data into the database and then use SQL for calculation, but the loading process is very cumbersome and the real-time performance is also very poor. Sometimes, you have to turn to ETL tools. The framework is aggravated, the risk is increased, and it is doubly troublesome to do mixed calculations.
Now, here comes esProc SPL, and these problems can be easily solved.
SPL is an open-source computing technology, which fully covers the computing power of SQL and supports a wide variety of data sources. SQL can now be used for structured data computing without RDB.
Perfect SQL Computing Power
SPL provides a syntax equivalent to the SQL92 standard and can perform rich and diverse data calculations, including filtering, calculating fields, selecting some columns, renaming, etc. You can directly execute SQL by using files such as text and XLS as data tables. Let's take the CSV file as the data source as an example:
- Filtering
Basic comparison operation:
$select * from d:/Orders.csv where Amount>=100
like:
$select * from d:/Orders.csv where Client like '%bro%'
Null
value judgment:
$select * from d:/Orders.csv where Client is null
Logical operators such as and
, or
and not
can combine comparison operations to realize combined filtering:
$select * from d:/Orders.csv
where not Amount>=100 and Client like 'bro' or OrderDate is null
in
:
$select * from d:/Orders.csv where Client in ('TAS','KBRO','PNS')
Multi-layer parentheses:
$select * from d:/Orders.csv
where (OrderDate<date('2020-01-01') and Amount<=100)
or (OrderDate>=date('2020-12-31') and Amount>100)
- Calculating columns
SPL has rich mathematical functions, string functions, and date functions:
$select round(Amount,2), price*quantity from d:/Orders.csv
$select left(Client,4) from d:/Orders.csv
$select year(OrderDate) from d:/Orders.csv
case when
:
$select case year(OrderDate)
when 2021 then 'this year'
when 2020 then 'last year'
else 'previous years' end
from d:/Orders.csv
coalesce
:
$select coalesce(Client,'unknown') from d:/Orders.csv
SELECT
$select OrderId, Amount, OrderDate from d:/Orders.csv
ORDER BY
$select * from d:/Orders.csv order by Client, Amount desc
DISTINCT
$select distinct Client ,Sellerid from d:/Orders.csv
GROUP BY … HAVING
$select year(OrderDate),Client ,sum(Amount),count(1)
from d:/Orders.csv group by year(OrderDate),Client having sum(Amount)<=100
Aggregation functions include sum
, count
, avg
, max
, and min
. Aggregation can be directly done without grouping:
$select avg(Amount) from d:/Orders.csv
JOIN
Left join
:
$select o.OrderId,o.Client,e.Name e.Dept,e.EId
from d:/Orders.txt o left join d:/Employees.txt e on o.SellerId=e.Eid
Right join
:
$select o.OrderId,o.Client,e.Name e.Dept,e.EId
from d:/Employees.txt e right join d:/Orders.txt o on o.SellerId=e.Eid
Full join
:
$select o.OrderId,o.Client,e.Name e.Dept,e.EId
from d:/Employees.txt e full join d:/Orders.txt o on o.SellerId=e.EId
Inner join
:
$select o.OrderId,o.Client,e.Name e.Dept
from d:/Orders.csv o inner join d:/Employees.csv e on o.SellerId=e.Eid
Inner join
can also be written in the form of where
:
$select o.OrderId,o.Client,e.Name e.Dept
from d:/Orders.csv o ,d:/Employees.csv e where o.SellerId=e.Eid
- Subquery
$select t.Client, t.s, ct.Name, ct.address from
(select Client ,sum(amount) s from d:/Orders.csv group by Client) t
left join ClientTable ct on t.Client=ct.Client
with:
$with t as (select Client ,sum(amount) s from d:/Orders.csv group by Client)
select t.Client, t.s, ct.Name, ct.address from t
left join ClientTable ct on t.Client=ct.Client
Subquery within in
:
$select * from d:/Orders.txt o where o.sellerid in (select eid from d:/Employees.txt)
AS
Use the 'as
' the keyword to rename fields, calculated columns, physical tables, and subqueries:
$select price*quantity as subtotal from d:/detail.csv
Set
operations
Including union
, union all
, intersect
, minus
. Here is an example:
$select * from Orders1.csv union all select * from Orders2.csv
into
The query results can be written to the file with the keyword 'into
':
$select dept,count(1) c,sum(salary) s into deptResult.xlsx
from employee.txt group by dept having s>100000
Rich Data Sources Support
SPL supports various non-database data sources, including text in various non-standard formats. CSV has been shown in the previous examples. Tab-separated TXT can also be supported, and SPL will process it automatically according to the extension:
$select * from d:/Orders.txt where Amount>=100 and Client like 'bro' or OrderDate is null
If the separator is not a comma or tab, you need to use the SPL extension function. For example, the separator is a colon:
$select * from {file("d:/Orders.txt").import@t (;":")}
where Amount>=100 and Client like 'bro' or OrderDate is null
For files without title lines, column names can be represented by serial numbers:
$select * from {file("d:/Orders.txt").import()} where _4>=100 and _2 like 'bro' or _5 is null
Some strings in special formats should also be parsed with extension functions. For example, the date format is not standard yyyy-MM-dd:
$select year(OrderDate),sum(Amount) from
{file("d:/Orders.txt").import@t(orderid,client,sellerid,amount,orderdate:date:"dd-MM-yyyy")}
group by year(OrderDate)
SQL can also be executed on Excel files. For Excel with a standard format, you only need to directly reference the file name:
$select * from d:/Orders.xlsx where Amount>=100 and Client like 'bro' or OrderDate is null
You can also read the specified sheet:
$select * from {file("D:/Orders.xlsx").xlsimport@t (;"sheet3")}
where Amount>=100 and Client like 'bro' or OrderDate is null
CSV / XLS file downloaded from the remote website:
$select * from {httpfile("http://127.0.0.1:6868/Orders.csv).import@tc() }
where Amount>=100 and Client like 'bro' or OrderDate is null
HTTP protocol has many features, such as character set, port number, post parameter, header parameter, login authentication, etc. SPL extension functions can support all of them. The extension function can also grab the table data on the web page and support downloading files from the FTP server, which will not be elaborated on here.
The JSON file will be read as a string before parsing:
$select * from {json(file("d:\\data.json").read())}
where Amount>=100 and Client like 'bro' or OrderDate is null
There are few two-dimensional JSON, and multi-layer is the norm. The SPL extension function can convert multi-layer data into two-dimensional records and then calculate them with SQL. The details will not be explained here.
Restful JSON
$select * from {json(httpfile("http://127.0.0.1:6868/api/getData").read())}
where Amount>=100 and Client like 'bro' or OrderDate is null
If there are many and long extension functions, they can be written in the step-by-step form:
| A |
1 | =httpfile("http://127.0.0.1:6868/api/getData") |
2 | =A1.read() |
3 | =json(A2) |
4 | $select * from {A3} where Amount>=100 and Client like 'bro' or OrderDate is null |
Similar to CSV / XLS, SPL can also read JSON / XML files on HTTP websites.
XML
$select * from {xml(file("d:/data.xml").read(),"xml/row")}
where Amount>=100 and Client like 'bro' or OrderDate is null
Web Service
$select * from {ws_call(ws_client("http://.../entityWS.asmx?wsdl"),
"entityWS ":" entityWSSoap":"getData")}
where Amount>=100 and Client like'bro' or OrderDate is null
SPL can also support NoSQL.
MongoDB
$select * from {mongo_shell@x
(mongo_open("mongodb://127.0.0.1:27017/mongo"),"main.find()")}
where Amount>=100 and Client like 'bro' or OrderDate is null
There are often multi-layer data in MongoDB, including restful and web service, and they all can be converted into two-dimensional data with SPL extension functions.
Salesforce
$select * from {sf_query(sf_open(),"/services/data/v51.0/query",
"Select Id,CaseNumber,Subject From Case where Status='New'")}
where Amount>=100 and Client like 'bro' or OrderDate is null
Hadoop HDFS csv/xls/json/xml:
| A |
1 | =hdfs_open(;"hdfs://192.168.0.8:9000") |
2 | =hdfs_file(A1,"/user/Orders.csv":"GBK") |
3 | =A2.import@t() |
4 | =hdfs_close(A1) |
5 | $select Client,sum(Amount) from {A3} group by Client |
HBase
| A |
1 | =hbase_open("hdfs://192.168.0.8", "192.168.0.8") |
2 | =hbase_scan(A1,"Orders") |
3 | =hbase_close(A1) |
4 | $select Client,sum(Amount) from {A2} group by Client |
HBase
also has access methods such as filter and CMP, which can be supported by SPL.
Hive has a public JDBC interface, but its performance is poor. SPL provides a high-performance interface:
| A |
1 | =hive_client("hdfs://192.168.0.8:9000","thrift://192.168.0.8:9083","hive","asus") |
2 | =hive_query(A1, "select* fromtable") |
3 | =hive_close() |
4 | $select Client,sum(Amount) from {A2} group by Client |
Spark
| A |
1 | =spark_client("hdfs://192.168.0.8:9000","thrift://192.168.0.8:9083","aa") |
2 | =spark_query(A1,"select * from tablename") |
3 | =spark_close(A1) |
4 | $select Client,sum(Amount) from {A2} group by Client |
Alibaba cloud
| A |
1 | =ali_open("http://test.ots.aliyuncs.com","LTAIXZNG5zzSPHTQ","sa","test") |
2 | =ali_query@x(A1,"test",["id1","id2"],[1,"10001"]:[10,"70001"], ["id1","id2","f1","f2"],f1>=2000.0) |
3 | $select Client,sum(Amount) from {A2} group by Client |
Cassandra
| A |
1 | =stax_open("127.0.0.1":9042,"mycasdb","cassandra":"cassandra") |
2 | =stax_query(A1,"select * from user where id=?",1) |
3 | =stax_close(A1) |
4 | $select Client,sum(Amount) from {A2} group by Client |
ElasticSearch
| A |
1 | =es_open("localhost:9200","user":"un1234") |
2 | =es_get(A1,"/person/_mget","{\"ids\":[\"1\",\"2\",\"5\"]}") |
3 | =es_close(A1) |
4 | $select Client,sum(Amount) from {A2} group by Client |
Redis
| A |
1 | =redis_open() |
2 | =redis_hscan(A1, "runoobkey", "v*", 3) |
3 | =redis_close (A1) |
4 | $select key,value from {A2} where value>=2000 and value<3000 |
SAP BW
| A |
1 | =sap_open("userName","passWord","192.168.0.188","00","000",”E") |
2 | =sap_cursor(A1, "Z_TEST1","IT_ROOM").fetch() |
3 | =sap_close(A1) |
4 | $select * from {A2} where Vendor like '%software%' |
InfluxDB
| A |
1 | =influx_open("http://127.0.0.1:8086", "mydb", "autogen", "admin", "admin") |
2 | =influx_query(A1, "SELECT * FROM Orders") |
3 | =influx_close(A1) |
4 | $select Client,sum(Amount) from {A2} group by Client |
Kafka
| A |
1 | =kafka_open("D://kafka.properties";"topic-test") |
2 | =kafka_poll(A1) |
3 | =kafka_close (A1) |
4 | $select Client,sum(Amount) from {A2} group by Client |
MDX multidimensional database
| A |
1 | =olap_open("http://192.168.0.178:8088/msmdpump.dll", "CubeTest","Administrator","admin")
|
2 | =olap_query(A1,"with member [Measures].[AnnualInterestRate] as'[Measures].[SalesAmount]/[Measures].[StandardCost]-1'select {[Measures].[SalesAmount],[Measures].[StandardCost], [Measures].[AnnualInterestRate]} on columns,{[Order Date].[Calendar Year].[Calendar Year]} on rows from [DataSourceMulti]")
|
3 | =olap_close(A1) |
4 | $select * from {A2} where SalesAmount>10000 |
In addition to supporting a wide variety of data sources, SPL can also perform mixed calculations between data sources. For example, between CSV and RDB:
$select o.OrderId,o.Client,e.Name e.Dept
from d:/Orders.csv o inner join d:/Employees.xls e on o.SellerId=e.Eid
Between MongoDB and database:
| A | B |
1 | =mongo_open("mongodb://127.0.0.1:27017/mongo") |
2 | =mongo_shell@x(A1,"detail.find()").fetch() | =connect("orcl").query@x("select * from main") |
3 | $select d.title, m.path,sum(d.amount)from {A2} as d left join {B2} as m on d.cat=m.cat group by d.title, m.path
|
Mixed calculations can be performed between any data sources, and the SQL syntax is not affected by the data sources.
Deeper Computing Power
In fact, the original meaning of SPL is Structure Process Language, which is a language specially used for structured data processing. In the previous examples, some of the syntaxes of SPL itself (those extension functions) have been shown. SQL is just a function provided by SPL, and SPL itself has more powerful and convenient computing power than SQL. Some calculation logic is complex, and it is difficult to code in SQL or even stored procedures, while SPL can complete the calculation with simpler code.
For example, here is a task: calculate the longest consecutive rising days of stock. SQL uses multi-layer nested subqueries and window functions, and the code is lengthy and difficult to understand:
select max(continuousDays)-1
from (select count(*) continuousDays
from (select sum(changeSign) over(order by tradeDate) unRiseDays
from (select tradeDate,
case when price>lag(price) over(order by tradeDate)
then 0 else 1 end changeSign
from AAPL) )
group by unRiseDays)
While SPL only needs two lines:
| A | B |
1 | =T("d:/AAPL.xlsx") | Read Excel file, the first line is the title |
2 | =a=0,A1.max(a=if(price>price[-1],a+1,0)) | Get the max continuous rising days |
For simple calculations, using basic SQL is very convenient, but when the calculation requirements become complex, SQL is not applicable. Even if more functions (such as window functions) are provided, the calculation cannot be simplified. In this case, we recommend that users directly use SPL with concise code instead of writing multi-layer nested complex SQL. For this reason, SQL in SPL only supports the SQL92 standard and does not provide more syntax including window functions.
SQL does not advocate multi-step calculation. It is used to write a calculation task in a large statement, which will increase the difficulty of the task. SPL naturally supports multi-step calculation, and can easily split complex large calculation tasks into simple small tasks, which will greatly reduce the difficulty of coding. For example, find out the top n major customers whose cumulative sales account for half of the total sales, and rank them according to the sales from large to small:
| A | B |
1 | = T("D:/data/sales.csv").sort(amount:-1) | Fetch data, sort in descending order |
2 | =A1.cumulate(amount) | Calculate cumulative sequence |
3 | =A2.m(-1)/2 | The last cumulative value is the sum |
4 | =A2.pselect(~>=A3) | The required position (more than half) |
5 | =A1(to(A4)) | Get values by position |
Flexible Application Structure
How to use SPL?
For interactive calculation and analysis, SPL has a professional IDE, which not only has complete debugging functions, but also can visually observe the intermediate calculation results of each step:
SPL also supports command line execution and supports any mainstream operating system:
D:\raqsoft64\esProc\bin>esprocx.exe -R select Client,sum(Amount)
from d:/Orders.csv group by Client
Log level:INFO
ARO 899.0
BDR 4278.8
BON 2564.4
BSF 14394.0
CHO 1174.0
CHOP 1420.0
DYD 1242.0
…
For the calculation in the application, SPL provides a standard JDBC driver and can be easily integrated into Java:
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
PrepareStatement st = conn.prepareStatement
("$select * from employee.txt where SALARY >=? and SALARY<?");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
…
There will be frequent modifications or complex calculations in the application. SPL allows the code to be placed outside the Java program, which can significantly reduce the code coupling. For example, the above SPL code can be saved as a script file and then called in the form of a stored procedure in Java:
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call getQuery(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
…
With open-source SPL, you can easily use SQL without RDB.
Extended Reading
History
- 24th June, 2022: Initial version