Introduction
Most solutions to convert RTF to plain text with pure T-SQL don't handle special characters like German umlauts and all the other special characters above ASCII(128) because they are not embedded in RTF tags but noted as escaped hex values. Also most of these solutions leave a trailing '}' at the end of the converted text. This revised procedure will solve these problems.
Background
Searching the web for a T-SQL procedure to convert RTF-formatted text to plain text, you'll find a lot of matches. Mainly, there are 2 methods described: the first one uses the RichtextCtrl
control with the need to reconfigure SQL server settings to allow access to OLE/COM which might be a problem in environments with high security guidelines (e.g. http://www.experts-exchange.com/Database/MS-SQL-Server/Q_27633014.html). The second one will be found in some slightly different versions which all produce results with restrictions as described above (e.g. http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=90034).
Using the Code
Add the following SQL function to your database:
USE [<YourDatabaseNameHere>]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE FUNCTION [dbo].[RTF2Text]
(
@rtf nvarchar(max)
)
RETURNS nvarchar(max)
AS
BEGIN
DECLARE @Pos1 int;
DECLARE @Pos2 int;
DECLARE @hex varchar(316);
DECLARE @Stage table
(
[Char] char(1),
[Pos] int
);
INSERT @Stage
(
[Char]
, [Pos]
)
SELECT SUBSTRING(@rtf, [Number], 1)
, [Number]
FROM [master]..[spt_values]
WHERE ([Type] = 'p')
AND (SUBSTRING(@rtf, Number, 1) IN ('{', '}'));
SELECT @Pos1 = MIN([Pos])
, @Pos2 = MAX([Pos])
FROM @Stage;
DELETE
FROM @Stage
WHERE ([Pos] IN (@Pos1, @Pos2));
WHILE (1 = 1)
BEGIN
SELECT TOP 1 @Pos1 = s1.[Pos]
, @Pos2 = s2.[Pos]
FROM @Stage s1
INNER JOIN @Stage s2 ON s2.[Pos] > s1.[Pos]
WHERE (s1.[Char] = '{')
AND (s2.[Char] = '}')
ORDER BY s2.[Pos] - s1.[Pos];
IF @@ROWCOUNT = 0
BREAK
DELETE
FROM @Stage
WHERE ([Pos] IN (@Pos1, @Pos2));
UPDATE @Stage
SET [Pos] = [Pos] - @Pos2 + @Pos1 - 1
WHERE ([Pos] > @Pos2);
SET @rtf = STUFF(@rtf, @Pos1, @Pos2 - @Pos1 + 1, '');
END
SET @rtf = REPLACE(@rtf, '\pard', '');
SET @rtf = REPLACE(@rtf, '\par', '');
SET @rtf = STUFF(@rtf, 1, CHARINDEX(' ', @rtf), '');
WHILE (Right(@rtf, 1) IN (' ', CHAR(13), CHAR(10), '}'))
BEGIN
SELECT @rtf = SUBSTRING(@rtf, 1, (LEN(@rtf + 'x') - 2));
IF LEN(@rtf) = 0 BREAK
END
SET @Pos1 = CHARINDEX('\''', @rtf);
WHILE @Pos1 > 0
BEGIN
IF @Pos1 > 0
BEGIN
SET @hex = '0x' + SUBSTRING(@rtf, @Pos1 + 2, 2);
SET @rtf = REPLACE(@rtf, SUBSTRING(@rtf, @Pos1, 4), _
CHAR(CONVERT(int, CONVERT (binary(1), @hex,1))));
SET @Pos1 = CHARINDEX('\''', @rtf);
END
END
SET @rtf = @rtf + ' ';
SET @Pos1 = PATINDEX('%\%[0123456789][\ ]%', @rtf);
WHILE @Pos1 > 0
BEGIN
SET @Pos2 = CHARINDEX(' ', @rtf, @Pos1 + 1);
IF @Pos2 < @Pos1
SET @Pos2 = CHARINDEX('\', @rtf, @Pos1 + 1);
IF @Pos2 < @Pos1
BEGIN
SET @rtf = SUBSTRING(@rtf, 1, @Pos1 - 1);
SET @Pos1 = 0;
END
ELSE
BEGIN
SET @rtf = STUFF(@rtf, @Pos1, @Pos2 - @Pos1 + 1, '');
SET @Pos1 = PATINDEX('%\%[0123456789][\ ]%', @rtf);
END
END
IF RIGHT(@rtf, 1) = ' '
SET @rtf = SUBSTRING(@rtf, 1, LEN(@rtf) -1);
RETURN @rtf;
END
When copying the above code to SQL don't forget to remove the underscore (wich is only required in codeproject to break long lines)!
To convert any RTF-formatted content, call the function above passing the RTF content as parameter of type nvarchar(max)
:
SELECT [<YourRTFColumnNameHere>]
, [dbo].[RTF2Text]([<YourRTFColumnNameHere>]) AS [TextFromRTF]
FROM [dbo].[<YourDatabaseNameHere>]
The function returns the converted text as nvarchar(max)
too.
More improvements may be added. If you find any RTF part that isn't covered by the function above, please drop a line here.
Thanks
Thanks to all the authors in the web that have posted their solutions until now and therefore deserve the applause. I simply enhanced these solutions to complete the basic conversion.
Thanks also to all users here posting their tips to make the procedure more robust.