Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

How to Substring Articles/News with HTML Tags on Server-Side

0.00/5 (No votes)
7 Jul 2015 1  
In this tip, we will learn how to summarize text files with HTML tags on server-side.

Introduction

In this tip, we will learn how to summarize text files with HTML tags on Server-Side. We will write a method that substrings the text and closes all tags which are left open.

Background

Summarizing is easy thing to do. However, if you have designed a blog or news page with text editors, you will understand the problem. The tags will be left open or you can cut the tag inside. Most people are doing this with JS or JS tools, sending the whole article to the index page and summarizing the article on the client-side. Simply, we don't need to do that. Moreover, imagine that you have a blog page and you listed 10-20 articles with summaries on the index page. If you send the whole article from the server, myriads of characters will be wasted.

Using the Code

The SubstringHtml method is easy to use. You just need to send the text and enter number of the length value that remains after the process.

public string SubstringHtml(string stringValue, int length)

First, we need to define regular expressions.

var regexAllTags = new Regex(@"<[^>]*>");
var regexIsTag = new Regex(@"<|>");
var regexOpen = new Regex(@"<[^/][^>]*>");
var regexClose = new Regex(@"</[^>]*>");
var regexAttribute = new Regex(@"<[^ ]*");

"regexAllTags" will be used to detect the length value. Clearly, string value will be saved to database with HTML tags, so the length value includes tags. Therefore, we need to remove all tags and detect the text length.

"regexOpen" and "regexClose" will be used for detecting open tags and close tags.

"regexAttribute" will be used to remove attributes in the open tags and to transform them into close tags.

"regexIsTag" will be used to define whether we break the tag inside.

int necessaryCount = 0;

if (regexAllTags.Replace(stringValue, "").Length <= length)
{
    return stringValue;
}

If the "stringValue" without tags have lower length than "length", we don't need to do cutting process.

string[] split = regexAllTags.Split(stringValue);
string counter = "";

foreach (string item in split)
{
   if (counter.Length < length && counter.Length + item.Length >= length)
   {
       necessaryCount = stringValue.IndexOf(item,counter.Length) 
       + item.Substring(0, length - counter.Length).Length;

       break;
   }

   counter += item;
}

In this part, we are removing all tags and split the text into sections to detect which section we need to cut. After that, search the section in the original string that have tags and the start index will be "counter.Length". Finally, we detected the necessary count to cut the original string in "length" value.

var x = regexIsTag.Match(stringValue, necessaryCount);
if (x.Value == ">")
{
    necessaryCount = x.Index + 1;
}

In this part, we are checking whether we break the tag inside. Breaking the tag is not possible on the split sections technique. However, you may need this code if you change/add any code.

Finally, we safely detected the necessaryCount as index value on the original string. Now, we are going to cut the text and close all tags that are left open.

string subs = stringValue.Substring(0, necessaryCount);

var openTags = regexOpen.Matches(subs);

var closeTags = regexClose.Matches(subs);
List<string> OpenTags = new List<string>();
foreach (var item in openTags)
{
    string trans = regexAttribute.Match(item.ToString()).Value;

    trans = "</" + trans.Substring(1, trans.Length - 1);

    if (trans.Last() != '>')
    {
        trans += ">";
    }

    OpenTags.Add(trans);
}

In this section, we are removing all the attributes (also blank spaces) on the open tags. After that, we are converting the tags into close tags to compare with the close tags list.

foreach (System.Text.RegularExpressions.Match close in closeTags)
{
    OpenTags.Remove(close.Value);
}

We need to compare two lists to detect and remove all closed tags in OpenTags list. Now, we have the list of tags which are left open.

for (int i = OpenTags.Count - 1; i >= 0; i--)
{
    if(i == 0) subs += "...";
    subs += OpenTags[i];
}

return subs;

Finally, create a reverse loop (Last in first out) and add the close tags to end of the string.

Example

string ex = "<p>hellooo codeproject<a href='blah'> blah</a><strong> blahblahh</strong> dsafsdf</p>"; 
string substring = SubstringHtml(ex, 30);

And the result is:

substring = "<p>hellooo codeproject<a href='blah'> blah</a><strong> blahb</strong></p>"

If you check the length of the text "hellooo codeproject blah blahb", you can see the result is 30.

Conclusion

I hope this tip will help you. Please share your valuable thoughts and comments. Your feedback is always welcome.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here