RegEx : How to match string that is not HTML tag or Special Char...?

Question

5.00/5 (2 votes)

See more:

XML

Hai,

I have some HTML Text. When i display that i want to highlight some keywords.  I dont want to match if that is a part of html tag or any special characters like &nbsp;

for eg :
My HTML Text : <span>Hello&#160;&#160;Welcome to my Spa No. 160</span>

my keywords : spa 160

for highlighting i use <span class="highlight">keyword</span>

But now its matching the spa inside the tag <span> and 160 inside the special char &#160;

How to overcome this...??? I use C# RegEx.

I need a RegEx that matches the keyword but not in tags or special characters.

Advance thank you.

Posted 25-Jul-13 21:22pm

good.shankar

Add a Solution

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Brian A Stephens · Answer 1 · 2013-11-01T02:27:00

What you want is negative lookbehinds:

(?<!</?[^>]*|&[^;]*)(\b160\b|\bspa\b)

and replace with

<span class="highlight">$1</span>

The negative lookbehind syntax is: (?<! ... ), which indicates that the keyword cannot be preceded by a certain pattern. That pattern in this case is either the beginning of a tag </?[^>]* or the beginning of an HTML entity &[^;]* that isn't complete.

</?[^>]* indicates an open bracket, possibly followed by a slash, followed by any number of chars that aren't close brackets.

&[^;]* indicates an ampersand followed by any number of chars that aren't semicolons.

Here's how to incorporate this into your C# code:

C#

string[] keywords = { "spa", "160", "whatever" };
Regex.Replace(htmlContent, "(?<!</?[^>]*|&[^;]*)(\b" + string.Join("\b|\b", keywords) + "\b)", "<span class=\"highlight\">$1</span>", RegexOptions.IgnoreCase);

EDIT: I incorporated the good point made by Andreas Gieriet - that you need to ensure you are matching complete "words" only by matching word boundaries with \b.

Kyle A.B. · Answer 2 · 2013-11-01T00:57:00

Not real sure what you are trying to accomplish here, if you want to highlight "Spa No. 160" try this RegEx:

"[Ss]pa\s[Nn]o\.\s160"

If you want to highlight just the words "Spa" and "160" then try this one:

"(?<!(\<|\<\/))[Ss]pa|(?<!\&\#)160"

The above RegEx uses a negative look behind to ensure that it doesn't include a < or </ before Spa or spa and it doesn't include a &# before 160.

Negative Look Behind[^]

-Kyle

Andreas Gieriet · Answer 3 · 2013-11-01T03:17:00

Use \b around the keywords to anchor words that must stand for themselves.
To additionally ignore HTML entities, you may take beneft of the fact that .Net Regex behaves greedily by default: prefix the match for the words by an alternative for matching the entities first, e.g.

C#

string pattern = @"[&#]\w+?;|(\bspa\b|\b160\b)";
foreach (var match in Regex.Matches(input, pattern))
{
   if (match.Groups[1].Success)
   {
      string text = match.Groups[1].Value;
      ...
   }
}

Or with Linq:

C#

...
var emphasis = Regex.Matches(input, pattern).Cast<Match>().Where(m=>m.Groups[1].Success).Select(m=>m.Groups[1].Value);
foreach(string text in emphasis)
{
   ...// do emphasize
}

Cheers
Andi

RegEx : How to match string that is not HTML tag or Special Char...?

3 solutions

Solution 2

Solution 1

Solution 3

Add your solution here

Preview 0