Introduction
How to extract data from a table in SQL Server with XML fields which holds different XML namespaces into a one result set.
Background
I maintain a database which contains a master table with logs of user activities. This particular database is a part of a software that works as an integration bus between various third-party information systems. Every row within the log table contains two XML fields: an input object and an output object. The input object holds request parameters towards any third-party system; the output object holds the answer from the particular third-party system. The input object doesn't include any XML namespaces inside, but the output object does. How to extract all data from a table’s row itself and both XML fields?
Data extraction issue
The log table looks as follows:
CREATE TABLE [dbo].[EventLog](
[IdEvent] int NOT NULL PRIMARY KEY,
[EventName] nvarchar(255) NOT NULL,
[EventDateTime] datetime NOT NULL,
[InputObject] xml NOT NULL,
[OutputObject] xml NOT NULL,
[MethodName] nvarchar(255) NOT NULL,
[IpAddress] nvarchar(255) NOT NULL,
[Success] bit NOT NULL,
[ErrorText] nvarchar(max) NULL,
[SystemGuid] uniqueidentifier NOT NULL)
This particular table contains logs of dozens different methods, each one requires its own structure of XML. Let's take a closer look at the structure of a one method.
This is the example of an input object. Please, don't worry about a content of the patient
tag, it's just a logged search string.
<SearchTop10Patient>
<patient>Name=Smith Surname=John Birthday=27.08.1936 0:00:00
SecondName= Document_N= Document_S=</patient>
<idLpu>9</idLpu>
<guid>560FFA9C-A096-4FFB-8A73-0CB065EA7450</guid>
<idHistory />
</SearchTop10Patient>
This is the example of an output object. It's not a complete object, I had truncated it, because it’s too big, but that's enough for an example.
<SearchTop10PatientResult xmlns="http://schemas.datacontract.org/2004/07/HubService2"
xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
<ErrorList />
<IdHistory>430995</IdHistory>
<Success>true</Success>
<ListPatient>
<Patient>
<Birthday>1936-08-27T00:00:00</Birthday>
<Phone>4347129</Phone>
<IdPatient>222097</IdPatient>
</Patient>
<Patient>
<Birthday>1936-08-27T00:00:00</Birthday>
<Phone />
<IdPatient>1504943</IdPatient>
</Patient>
</ListPatient>
</SearchTop10PatientResult>
Now I'd like to extract the data from the table: [IdEvent]
, [EventDateTime]
, [idLpu]
, [IdPatient]
, [Birthday]
and [guid]
:
[IdEvent]
and [EventDateTime]
are contained in the table;[idLpu]
and [guid]
are parts of the input object (the first XML field);[IdPatient]
and [Birthday]
are parts of the output object (the second XML field).
Let’s try this simple code for the extracting of the necessary data:
;WITH XMLNAMESPACES(DEFAULT N'http://schemas.datacontract.org/2004/07/HubService2')
SELECT
el.[IdEvent],
el.[EventDateTime],
pats.value('./IdPatient[1]','int') AS [IdPatient],
pats.value('./Birthday[1]','date') AS [Birthday],
search.value('./idLpu[1]','nvarchar(20)') AS [idLpu],
search.value('./guid[1]','uniqueidentifier') AS [guid]
FROM [EventLog] AS el CROSS APPLY [OutputObject].nodes(
'/SearchTop10PatientResult/ListPatient/Patient') AS ST10Pout(pats)
CROSS APPLY [InputObject].nodes('/SearchTop10Patient') AS ST10Pin(search)
WHERE (el.[MethodName] = 'SearchTop10Patient')
AND(el.[EventDateTime] BETWEEN '05.02.2013' AND '06.02.2013')
This code returns nothing! The problem is that the WITH XMLNAMESPACES
construction applies to both XML fields as the default XML namespace: for the input object and for the output object. If I comment strings referring to one of the XML fields, it will work well, but the code will return only the data from a table row itself with one of the two XML fields data. How to solve this jigsaw?
Bad solution
The first solution that had arisen in my mind is using of temporary tables: the joinings of the table and each XML fields are extracted in two temporary tables correspondingly; then those temporary tables can be joined for the final data set. The joint data can be selected or even can be placed in the third temporary table... That solution sounds awful for me. Two or three temporary tables for a single data extraction is a wasting of server efficiency, the resulting code becomes more bulky also.
Better solution
I've found a better solution for my issue. I split up the data extraction code into two parts:
- the table function makes the joining of the table with the output XML object;
- the main code makes the joining of the table with the input XML object and join it with the table function.
So, the first part of the data extraction algorithm now looks as follows:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[fun_ExtractObject]
(
@CheckDate datetime
)
RETURNS TABLE
AS
RETURN
(
WITH XMLNAMESPACES(DEFAULT N'http://schemas.datacontract.org/2004/07/HubService2')
SELECT
elog.[IdEvent],
x.value('./IdPatient[1]','nvarchar(50)') AS [IdPatient],
x.value('./Birthday[1]','date') AS [Birthday]
FROM [dbo].[EventLog] AS elog CROSS APPLY [OutputObject].nodes(
'/SearchTop10PatientResult/ListPatient/Patient') AS e(x)
WHERE (elog.[MethodName] = 'SearchTop10Patient')
AND(elog.[EventDateTime] BETWEEN @CheckDate AND DATEADD(DAY,1,@CheckDate))
)
The second part of the algorithm looks as follows:
DECLARE @CheckDate datetime
SET @CheckDate = '05.02.2013'
SELECT
elog.[IdEvent],
elog.[EventDateTime],
x.value('./idLpu[1]','nvarchar(50)') AS [idLpu],
x.value('./guid[1]','uniqueidentifier') as [guid],
f_eo.[IdPatient],
f_eo.[Birthday]
FROM [dbo].[EventLog] AS elog CROSS APPLY [InputObject].nodes('/SearchTop10Patient') AS e(x)
INNER JOIN [dbo].[fun_ExtractObject](@CheckDate) AS f_eo ON elog.[IdEvent] = f_eo.[IdEvent]
WHERE (elog.[MethodName] = 'SearchTop10Patient')
AND(elog.[EventDateTime] BETWEEN @CheckDate AND DATEADD(DAY,1,@CheckDate))
The input parameter of the table function helps to minimize amount of data for joining, because this particular log table contains more than 100GB of data, and the volume is growing constantly. The [IdEvent]
field serves as a joining field for all pieces of data, because it's the primary key in the data table.
The outcome of the data extraction algorithm is a flat table that can be easily used in different reports.
Also, for the speed optimization I’m using a clustered index on the [IdEvent]
field and the following non-clustered index:
CREATE NONCLUSTERED INDEX [ind_EventLog_DtName] ON [dbo].[EventLog]
(
[EventDateTime] ASC,
[EventName] ASC
)
INCLUDE ([IdEvent]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
In case the [InputObject]
XML field contains its own XML namespace, this namespace should be declared before a SELECT statement by the WITH XMLNAMESPACES
construction. Moreover, if the WITH XMLNAMESPACES
construction is not the first statement in the script, it should contain a semicolon before it.
Probably the best solution
I suppose the best solution for my issue is a transforming either the input XML or the output XML field, so both of them will contain the same XML namespace or doesn't contain any. That sounds reasonably, but this approach will require some changes in the current code and also will require massive updates for one of the XML fields. All these improvements have to be done in the one moment.
Conclusion
Table function is a nice tool which can help to avoid the data extraction issue in case a data table contains XML fields with different XML namespaces. This approach makes resulting code looks slightly better for reading and maintaining.
Points of Interest
MSDN page about the WITH XMLNAMESPACES
construction.