Introduction
Though not in our day to day programming, but sometime we may need to parse the HTML text for extracting the data. Recently, I encountered the same and henceforth I thought of sharing this with you all.
Background
Recently one of my colleagues ran into a problem. His requirement is that he is getting some HTML text from where he needs to extract the relevant text. After Googling, I found some answers SQL SERVER – 2005 – UDF – User Defined Function to Strip HTML – Parse HTML – No Regular Expression but that won't meet the requirement. Henceforth, I have created one for him. At the same time, he has one more challenge. The HTML text can be anything i.e. it can have JavaScript functions, style tags, etc. Moreover, it can be embedded within the BODY
tags or there may not be any BODY
tags. In either case, only portions from within the body tags need to be parsed.
e.g.
Input(A):
<html><head><title></title><body><a href="#">Extract text from here</a></body></html>
Input(B)
<a href="#">Extract text from here</a>
Output: Extract text from here
Program / Sql Query
DECLARE @OriginalStr VARCHAR(MAX)
DECLARE @NewStr VARCHAR(MAX)
DECLARE @StartIndex INT
DECLARE @EndIndex INT
DECLARE @Noise TABLE(Noise VARCHAR(100),ReplaceChars VARCHAR(10))
An example of parsing html at <a href="%22http://www.codeproject.com%22">codeproject.com</a>' -- original data
SET @OriginalStr = '
Click the button <input type="button" value="some button" onclick="test();">
hello world
I am followed by a BLANK SPACE and then a terminator
It''s a new line
Fonts are nice features
Nice link......
<a href="%22http://www.codeproject.com%22">www.codeproject.com</a>
'
INSERT INTO @Noise(Noise,ReplaceChars)
SELECT ' ',space(1) UNION ALL SELECT '>',space(1) UNION ALL
SELECT '<',space(1) UNION ALL SELECT '
',space(2) UNION ALL
SELECT ' ',space(2)
-- End of initialisation
--Step 1: Extract text between and
SELECT @StartIndex= CHARINDEX('', @OriginalStr)
SELECT @EndIndex= CHARINDEX('',@OriginalStr)
SELECT @OriginalStr =
CASE WHEN(@StartIndex = 0 or @EndIndex = 0) THEN @OriginalStr
ELSE
SUBSTRING(@OriginalStr,@StartIndex ,_
((@EndIndex - @StartIndex) + 7 )) END-- Adding 7 since length of = 7
--Step 2: Extract the relevant text between the tags
--Generate a number table
;WITH Num_Cte AS
(
SELECT 1 AS rn
UNION ALL
SELECT rn +1 AS rn
FROM Num_Cte
WHERE rn <= LEN(@OriginalStr)
)
-- Shred into individual characters
, Get_Individual_Chars_Cte AS
(
SELECT
rn AS Id
,chars
FROM Num_Cte
CROSS APPLY( SELECT SUBSTRING(@OriginalStr,rn,1) AS chars) SplittedChars
)
--Combine the characters again to obtain the extracted text between the tags
SELECT @NewStr = ExtractedText FROM (
SELECT CAST(chars AS VARCHAR(MAX)) FROM (
--Get the characters between the tags
SELECT a.* FROM Get_Individual_Chars_Cte a
INNER JOIN
(
--Find the position of the characters between the tags
SELECT rn FROM Num_Cte EXCEPT
(SELECT n.rn
FROM Num_Cte n
JOIN(
-- Find the start and end range of characters
-- between < and > for all tags
SELECT
Id AS StartRange
,(SELECT TOP 1 Id
FROM Get_Individual_Chars_Cte
WHERE Id > a.Id
AND chars='>'
ORDER BY Id) AS EndRange
FROM Get_Individual_Chars_Cte a
WHERE chars='<')X(StartRange,EndRange)
ON n.rn BETWEEN X.StartRange AND X.EndRange)
)X(rn)
ON a.Id = X.rn
)X(id,chars)
FOR XML PATH('')
)X(ExtractedText)
OPTION(MAXRECURSION 0)
--Remove the noises
SELECT @NewStr = REPLACE(@NewStr, Noise, ReplaceChars) FROM @Noise
SELECT ParsedText = @NewStr
Output
ParsedText
Click the button hello world I am followed by a BLANK SPACE and then a terminator It's a new line Fonts are nice features Nice link...... www.codeproject.com
Explanation
First we need to check if the HTML given has any or not and that will be the input source.
The very next step after that is to pick up the relevant text.
For accomplishing this, we will take the help of a number table that has been created on the fly using a recursive cte.
Next split the data into individual characters. e.g. CROSS APPLY( SELECT SUBSTRING(@OriginalStr,rn,1) AS chars) SplittedChars.
The next step is to find the start and end range of characters between < and > for all tags.
Once this task is accomplished, the next step is to find the position of the characters between the tags and combine the characters again (by using FOR XML PATH('')) to obtain the extracted text between the tags.
After combining the characters, we will encounter some junk characters like ' ','>' etc.
For that reason, we will keep a noise table where the noises (let's call the junk characters so) will be replaced by one space. Once done, we will achieve our goal.
Points of Interest
This article showed a way of parsing data using the SET BASED approach. Also it shows us how we can replace data in one single Replace statement instead of multiple. The usage of FOR XML PATH in combining the characters is really handy as substantiated from this article.
History
- 12th January, 2010: Initial post