(untagged)

Very Fast Splitter with support for (Multi-Characters) String Separator

Remon Zakaria

0.00/5 (No votes)

17 Jun 2004

Very fast Split function that has the ability to treat a multi-character separator as a single separator or the regular multi separator characters.

Introduction

This is simply and shortly a fast splitter function that combines the classic Split function that splits an expression with a character or a stream of characters, and my new one which handles a stream of characters as a single separator.

Why New

Have you ever tried to make File Sender Program with resume support? You have to, for example, make a simple protocol having the File ID, packet number & packet data. Now, what separator you can use to split these data? If you tried to use a single character, then you put yourself in a risk that the packet might have by chance the same characters in the same order. Then your program is crashed. Now, you can come up with a stream of characters that has the least possibility of occurrence, to split your packet with, and let's assume it will be '(::)'. The problem is when you use the ordinary Split function, it matches any of the characters entered as a single splitter.

Example

Expression: Hi(::)How are you? :)I hope you are fine(::)

Output of ordinary Split:

Hi
Empty String
Empty String
Empty String
How are you?
Empty String
I hope you are fine
Empty String
Empty String
Empty String
Empty String

Output of my Split:

Hi
How are you? :)I hope you are fine

Usage

Function Header

public static string[] Split (string Expression , 
       string Delimiter, bool SingleSeparator, 
       int Count , ComparisonMethod Compare)

Expression: Expression to split.
Delimiter: String to split with.
SingleSeparator: true to consider the Delimiter characters as a single separator, false to execute the ordinary Split.
Count: Number of tokens to split from this Expression.
ComparisonMethod: Value indicates if delimiter matching is case sensitive or not.

Split Module Code

namespace Infinity
{
    public enum ComparisonMethod
    {
        Binary = 0,
        Text = 1
    }

    namespace StringSplitter
    {
        /// <summary>

        /// Split strings with support to multi-character,

        /// multi-lines Delimiter 

        /// </summary>

        public class CSplitter
        {
            /// <summary>

            /// Holds the string to split 

            /// </summary>

            private static string  m_Expression ; 
            /// <summary>

            /// Delimiter to split the expression with 

            /// </summary>

            private static string  m_Delimiter ;
            /// <summary>

            /// Constrctor for The Splitter

            /// </summary>

            public CSplitter()
            {
                //

                // TODO: Add constructor logic here

                //

            }

            private static bool 
              isValidDelimiterBinary(int StringIndex, 
              int DelimiterIndex )
            {
                if (DelimiterIndex == m_Delimiter.Length)  return true;
                if (StringIndex == m_Expression.Length) return false;
                //If the current character of the expression matches 

                //the current character of the Delimiter, 

                //then go to next character

                if (m_Expression[StringIndex] == 
                             m_Delimiter[DelimiterIndex]) 
                    return isValidDelimiterBinary(StringIndex + 1, 
                                               DelimiterIndex + 1);
                else
                    return false;
            }
            private static bool 
               isValidDelimiterText(int StringIndex, 
               int DelimiterIndex )
            {
                if (DelimiterIndex == m_Delimiter.Length)  return true;
                if (StringIndex == m_Expression.Length) return false;
                //If the current character of the expression 

                //matches the current character of the Delimiter, 

                //then go to next character

                if (Char.ToLower(m_Expression[StringIndex]) 
                        == Char.ToLower(m_Delimiter[DelimiterIndex])) 
                    return isValidDelimiterText(StringIndex + 1, 
                                        DelimiterIndex + 1);
                else
                    return false;
            }

            public static string[] Split(string Expression, 
               string  Delimiter, bool SingleSeparator, 
               int  Count, ComparisonMethod Compare) 
            {
                //Update Private Members 

                m_Expression = Expression;
                m_Delimiter = Delimiter;

                //Array to hold Splitted Tokens

                System.Collections.ArrayList Tokens = 
                      new System.Collections.ArrayList ();
                //If not using single separator, 

                //then use the regular split function

                if (!SingleSeparator)
                    if (Count >=0)
                      return Expression.Split(Delimiter.ToCharArray(), Count);
                    else
                      return Expression.Split(Delimiter.ToCharArray());

                //Check if count = 0 then return an empty array 

                if (Count ==0)
                    return new string [0];
                else
                    //Check if Count = 1 then return the whole expression

                    if (Count == 1)
                        return new string [] {Expression};
                    else
                        Count --;

                // Indexer to loop over the string with 

                int i ;
                //The Start index of the current 

                //token in the expression

                int iStart = 0 ;

                if (Compare == ComparisonMethod.Binary) 
                {
                    for (i = 0 ; i < Expression.Length ; i++)
                    {
                        if (isValidDelimiterBinary(i, 0))
                        {
                            //Assign New Token 

                            Tokens.Add (Expression.Substring(iStart, 
                                                          i - iStart));
                            //Update Index 

                            i += Delimiter.Length - 1;
                            //Update Current Token Start index

                            iStart = i + 1;
                            //If we reached the tokens limit , then exit For 

                            if (Tokens.Count == Count && Count >= 0) break;
                        }
                    }
                }
                else
                {
                    for (i = 0 ; i < Expression.Length ; i++)
                    {
                        if (isValidDelimiterText(i, 0))
                        {
                            //Assign New Token 

                            Tokens.Add (Expression.Substring(iStart, 
                                                            i - iStart));
                            //Update Index 

                            i += Delimiter.Length - 1;
                            //Update Current Token Start index

                            iStart = i + 1;
                            //If we reached the tokens limit , then exit For 

                            if (Tokens.Count == Count && Count >= 0) break;
                        }
                    }
                }
                string LastToken = "";
                //If there is still data & have not been added

                if (iStart < Expression.Length)
                {
                    LastToken = Expression.Substring(iStart, 
                                        Expression.Length - iStart);
                    if(LastToken == Delimiter)
                        Tokens.Add (null);
                    else
                        Tokens.Add (LastToken);
                }
                else
                    //If there is no elements in the tokens array, 

                    //then pass the whole string as the one element

                    if (Tokens.Count == 0) Tokens.Add (Expression);
                        //Return Splitted Tokens

                        return (string [])
                            Tokens.ToArray(Type.GetType("System.String"));
            }
        }
    }
}

Code in Details

Comparison Method Enumeration

public enum ComparisonMethod
{
       Binary = 0,
       Text = 1
}

Used to specify if the matching is case sensitive (Binary) or not (Text).

CSplitter members:

// Holds the string to split 

private static string  m_Expression ; 
// Delimiter to split the expression with 

private static string  m_Delimiter ;

Those variables I have made because I need them in Delimiter Matching function. And it�s not logical to send them as parameters every time I call those methods. So, I added them only once in the global section, and I pass only the indices of them as you can see below.

isValidDelimiterBinary Function

private static bool isValidDelimiterBinary(int StringIndex, int DelimiterIndex )
{
    if (DelimiterIndex == m_Delimiter.Length)  return true;
    if (StringIndex == m_Expression.Length) return false;
    //If the current character of the expression matches 

    //the current character of the Delimiter , then go to next character

    if (m_Expression[StringIndex] == m_Delimiter[DelimiterIndex]) 
        return isValidDelimiterBinary(StringIndex + 1, DelimiterIndex + 1);
    else
        return false;
}

This function is a recursive function used to take an Expression start index and Delimiter start index. This has the whole trick as I think; first, let�s go there line by line:

if (DelimiterIndex == m_Delimiter.Length)  return true;
if (StringIndex == m_Expression.Length) return false;

Those are 2 stop conditions:

First, one terminates the function if ALL the delimiter characters are checked and matched and returns true. The other one returns false if delimiter checking isn�t finished yet, but we reached the end of the expression, so it returns false.

if (m_Expression[StringIndex] == m_Delimiter[DelimiterIndex]) 
    return isValidDelimiterBinary(StringIndex + 1, DelimiterIndex + 1);
else
    return false;

If the current character of the expression matches the current character of the Delimiter, then call the function again with indices incremented by 1. When you call it from the main module, all you have to do is to send the start index you want matching to start from, & 0 as the DelimiterIndex to start from first character in delimiter.

bool res = isValidDelimiterText(i, 0);

isValidDelimiterText Function

It�s the same function exactly, but it is matched case insensitive way. I preferred to write two functions instead of checking whether user wants to match case sensitive or not every time I loop over expression characters. The only difference is this part in matching.

Char.ToLower(m_Expression[StringIndex]) == 
                        Char.ToLower(m_Delimiter[DelimiterIndex])

Here, I converted the two characters to lowercase to check them. Someone might ask me: Why you didn�t convert the whole string just one time to a temporary string or such, and work with it? Well, that�s a good idea. But the problem is that I loop once to convert them, and the second time to match them, and that�s not efficient. Another thing, Imagine a user sending a long string (30000 characters for instance) and he only wants two elements back. You will convert ALL the string while you might have the first separator which you need in the first 100 character? I guess this will be a performance disaster. :)

Split Function

Now, we go to the main function that does it all: first thing, we update the m_Expression and m_Delimiter member variables with the entered data.

m_Expression = Expression;
m_Delimiter = Delimiter;

//Array to hold Tokenized Tokens

System.Collections.ArrayList Tokens = new System.Collections.ArrayList();

This is an ArrayList to hold the tokenized data. We use it because you need fast, dynamic String-array convertible Object to hold the data.

SingleSeparator Parameter Handling

//If not using single separator, then use the regular split function

if (!SingleSeparator)
    if (Count >=0)
        return Expression.Split(Delimiter.ToCharArray(), Count);
    else
        return Expression.Split(Delimiter.ToCharArray());

This part checks if the user wants to use the regular split method or not. And if he wants to use the regular method, did he add the Count member or not?

Count Parameter Handling

//Check if count = 0 then return an empty array 

if (Count ==0)
        return new string [0];
else
        //Check if Count = 1 then return the whole expression

        if (Count == 1)
                return new string [] {Expression};
        else  
        Count--;

This part handles the Count parameters special cases as the following:

Count= 0. Return an empty string
Count= 1. Return the original string.
Else, decrement Count with one, this will be explained later.

The Main Loop

int i ; // Indexer to loop over the string with 

int iStart = 0 ; //The Start index of the current token in the expression 


if (Compare == ComparisonMethod.Binary) 
{
    for (i = 0 ; i < Expression.Length ; i++)
    {
        if (isValidDelimiterBinary(i, 0))
        {
            //Assign New Token 

            Tokens.Add (Expression.Substring(iStart, i - iStart));
            //Update Index 

            i += Delimiter.Length - 1;
            //Update Current Token Start index

            iStart = i + 1;
            //If we reached the tokens limit , then exit for 

            if (Tokens.Count == Count && Count >= 0) break;
        }
    }
}
else
{
    for (i = 0 ; i < Expression.Length ; i++)
    {
        if (isValidDelimiterText(i, 0))
        {
            //Assign New Token 

            Tokens.Add (Expression.Substring(iStart, i - iStart));
            //Update Index 

            i += Delimiter.Length - 1;
            //Update Current Token Start index

            iStart = i + 1;
            //If we reached the tokens limit , then exit for

            if (Tokens.Count == Count && Count >= 0) break;
        }
    }
}

Both parts of the if condition are the same, the only difference is one of them calls the isValidDelimiterText and the other part calls the isValidDelimiterBinary function. I will explain the Then part of the if condition (The binary matching):

for (i = 0 ; i < Expression.Length ; i++)
{
    if (isValidDelimiterBinary(i, 0))
    {
        //Assign New Token 

        Tokens.Add (Expression.Substring(iStart, i - iStart));
        //Update Index 

        i += Delimiter.Length - 1;
        //Update Current Token Start index

        iStart = i + 1;
        //If we reached the tokens limit , then exit for 

        if (Tokens.Count == Count && Count >= 0) break;
    }
}

This part does the loop thing. I used a for loop not an enumerator because I need to have an indexer to work with it. Yes, I might use the enumerator with an indexer incremented manually, but why more processing? :) Before we start, consider the string in the Demo Project: a(::)b(::)c()(::) (::)(::), we will split it by (::) characters. Now, we check if the current Expression character is the first of a stream of the Delimiter characters or not.

if (isValidDelimiterBinary(i, 0))

If yes, we do the following

//Assign New Token 

Tokens.Add (Expression.Substring(iStart, i - iStart));

Add characters from the start index to the character prior to the current character. So, for example: for the first delimiter found: i = 1 and iStart = 0, then string returned would be �a�.

//Update Index 

i += Delimiter.Length - 1;

Update the indexer i and make it jump over the delimiter characters.

//Update Current Token Start index

iStart = i + 1;

Update the next token start index iStart and make it point to the next character after the delimiter characters (Will be �b� in our case).

//If we reached the tokens limit, then exit for 

if (Tokens.Count == Count && Count >= 0) break;

This part checks if user asked for limited number of tokens, so we stop before the token number (Count) ends by one (we decremented it above). That is because we have to include the last part of the string at the last index of the limited array returned.

Remaining Characters Check

Now, we have finished the loop. Let�s see if there�re still remaining characters. If there are remaining characters, then we check and see if they are another delimiter. Then we add null string, else we add the remaining characters. If there is no remaining characters, then we check if there is a token returned or not, if no tokens returned, then add the whole string as one single token.

string LastToken = "";
//If there is still data & has not been added


if (iStart < Expression.Length){
  LastToken = Expression.Substring(iStart, Expression.Length - iStart);
  if(LastToken == Delimiter)
      Tokens.Add (null);
  else
      Tokens.Add (LastToken);
}
else
  //If there is no elements in the tokens array, 

  //then pass the whole string as the one element

      if (Tokens.Count == 0) Tokens.Add (Expression);

Return Array Of strings

Then at last, return the tokens as an array of string to the user.

//Return Tokenized Tokens

return (string [])Tokens.ToArray(Type.GetType("System.String"));

Disclaimer

This code is free for personal use. However, if you are going to use it for commercial purposes, you need to purchase a license.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here