Re: Count bitpattern occurences in textstrings - C# Discussion Boards

Re: Count bitpattern occurences in textstrings

invictus314-Jan-08 23:00

14-Jan-08 23:00

I think you misinterpreted my question. I am not after every occurence of the letter "1" but the bitpattern...like one letter is 8 or 16 bits.

Re: Count bitpattern occurences in textstrings

Ajay.k_Singh14-Jan-08 23:11

Ajay.k_Singh

14-Jan-08 23:11

If you want to search for a particular bit pattern within a string, you will first need to convert string to a byte array, this can be done using System.Text.ASCIIEncoding object.

After converting string to byte array we may search this array for a particular bit pattern. Following is a code snippet which will give you some idea about this-

---------------------Code Start-------------------

byte[] bytes;

byte pattern1 = 111;

string str = "hello there how are you";

System.Text.ASCIIEncoding getencoding = new ASCIIEncoding();

//convert string to byte array

bytes = getencoding.GetBytes(str);

int i;

for (i = 0; i < str.Length; i++)

{

if (bytes.GetValue(i).Equals(pattern1))
{

MessageBox.Show("");

}

}
------------------------Code End---------------

I hope this helps Smile | :)

.

-Dave.

Dave Traister,
ComponentOne LLC.
www.componentone.com

Re: Count bitpattern occurences in textstrings

invictus314-Jan-08 23:29

invictus3

14-Jan-08 23:29

But if I have a byte with 8 bits and another consecutive byte with another 8 bits, how can I check this entire range (16 bits) for only a 11 at any location?

for instance:

11000000 00000000 = 1 time
01110000 00000000 = 2 times
etc...

there is no fixed number representing the 11 since it can move around the byte...

Re: Count bitpattern occurences in textstrings

Russell Jones15-Jan-08 0:02

Russell Jones

15-Jan-08 0:02

i guess the crude way would be:

convert your pattern from byte [] to bool []
store the sequence you're searching for as bool[]

loop through the sequence until you match the first value then check the second value of the sequence etc. This isn't going to be quick but it'll work OK for short strings

Russ

Re: Count bitpattern occurences in textstrings

invictus315-Jan-08 0:05

invictus3

15-Jan-08 0:05

Hmmm its kind of unfortunate that is should be so complicated and slow because its for a (non-profit) research project where performance is everything...

Re: Count bitpattern occurences in textstrings

Russell Jones15-Jan-08 0:45

Russell Jones

15-Jan-08 0:45

If you are passing long pieces of binary data about the place as strings you have to be prepared for different people to mean different things by the same number. AFAIK ASCII leaves the choice of numbers above 127 to the person using them. While there are conventions for numbers 127-255 there are more than 1 so it may be that your data gets mangled on the way. .Net languages often try to make sense of this chaos by translating the string from one codepage to another; while this might be good for someone trying to read the string at the end it's the last thing you would want to happen to your data.

If you are going to search the same string repeatedly then maybe you could do something clever with trees, thus putting the effort in at the start when you build the tree and getting payback through faster searching later.

I would use the brute force method to start and make sure it's wrapped up in a tidy method so that you can come back and improve it later or try to find a better way of doing the same thing that doesn't require an exhaustive search of a list of bytes.

Russell

Re: Count bitpattern occurences in textstrings [modified]

Luc Pattyn15-Jan-08 0:56

Luc Pattyn

15-Jan-08 0:56

invictus3 wrote:
performance is everything...

then you need to tell more about the application, and I can come up with several suggestions.

furthermore there are some basic questions:
- is the length of the search pattern fixed? small? always 2? always less than or equal to 8?
- how do characters (or bytes) join together, is it most-significant-bit first, so
msb of byte1 connects to lsb of byte0, or the other way around?
- do you need exact location of a match? number of matches? first match only?
- what if the search pattern is periodic, i.e. how often do you recognize 0101 in 01010101
(2 or 3 times?)

BTW: IMO this topic belongs in the math&algo forum, not the C# forum.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

This month's tips:
- before you ask a question here, search CodeProject, then Google;
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get;
- use PRE tags to preserve formatting when showing multi-line code snippets.

modified on Tuesday, January 15, 2008 7:10:53 AM

Re: Count bitpattern occurences in textstrings

invictus315-Jan-08 23:20

invictus3

15-Jan-08 23:20

Ok here it goes:
- I can not tell more about the application at this point. At this point it is only at the planning stage where we test the possibility. Getting this algorithm right is a prerequisite for any further work.

- For sake of simplicity lets assume that we want to count for all 2-bit (00, 01, 10, 11)(and as a variant perhaps also 3-bit (000, 001, etc...)) combination. It is fixed sized pattern we want to search for, but the string we are searching in is variable sized.

- Whether it is MSB or LSB is irrelevant how I see it. It is all C# strings on a intel laptop, and that is pretty much what I know about that.

- I do not need location: only the count of each pattern

- The pattern you described would be counted 3 times. It will overlap locations already counted.

To be honest I don't want a complete code, but only pointers on how to approach the bit pattern count problem on my own. So far I got 2 possible solutions:
- convert each char into a binary string of 8 bytes where each letter is either 0 or 1 and then to regular string comparison
- use unsafe code and the & operator

Neither seems to be that efficient.

In Java I already solved this problem in a kind of inefficient way: first converting to bytes, then into a binary-string (e.g. each character from the original string takes up 8 chars with either 0/1 in the binary string) and then just do a normal string comparison (e.g. str.equals("01"). However, this is 1) inefficient and 2) in Java when C# is the targeted language. Unfortunately I suck at C# at this point but are reading up on it as we speak.

Re: Count bitpattern occurences in textstrings

Luc Pattyn16-Jan-08 8:56

Luc Pattyn

16-Jan-08 8:56

Hi,

here is my suggested algorithm:

- looking for a pattern of length P;
- have an accumulator that holds a sliding part of the binary data stream; one way to do
that is to initialize it to zero, then inside a loop shift it left over B positions and
ORing B bits of new data (B could be 8 or 16, the accumulator's bit count must
equal or exceed B+P-1);
- have a method that counts the number of pattern matches in the rightmost B+P-1 bits
of the accumulator, i.e. ignore the higher bits in the accumulator;
doing so makes sure every match will be counted only once.

There are basically two ways to do the count method:
1. inside a loop say for(x=0; x<b; x++),="" mask="" away="" all="" but="" p="" bits="" and="" compare="" those<br="" mode="hold"> with the pattern; so have a mask initially equal to (1<<P)-1, AND with accumulator,
compare with pattern; then shift left mask and pattern and compare again, etc.
2. if B+P-1 is small, say less than 16, you can precalculate an array with the
count value for every possible accumulator value, so the count method is just
a single array lookup.

Remarks:
- even if you think of your data as 16-bit items (say Unicode characters), you can still
work with B=8 if you choose to do so, the algorithm does not care what the bit
stream means;
- the sliding accumulator works best if the most-significant-bit of data item 1
connects to the least-significant-bit of data item 0; otherwise it will take
more complex load-and-shift operations.
- the algorithm requires B+P-1 not to exceed the length of the largest available integer
type; otherwise "big integer" techniques become necessary;
- precalculating an array is a common way to drastically improve performance provided
there is sufficient data to process; here the array's dimension is only 1<<(B+P-1)
which is 512 for B=8 and P=2.
- obviously when B+P-1 becomes large, allocating and precalculating an array may become
prohibitive and the loop-implementation may be prefered.
- when the pattern size increases AND the count of all possible pattern values is
required, it may not be obvious what the best order of execution would be:
either first loop over the pattern values, and scan the data multiple times,
or allocate 1<<P arrays and do all pattern values in parallel, scanning the data
only once.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Count bitpattern occurences in textstrings

invictus319-Jan-08 8:41

invictus3

19-Jan-08 8:41

I started reading up on bit-operators in C# and came up with this solution:

static int[] countbits(string s)<br />
{<br />
	int[] vector = { 0, 0, 0, 0 };<br />
<br />
	for (int i = 0; i < s.Length; i++)<br />
	{<br />
		for (int j = 6; j >= 0; j--)<br />
			vector[((s[i] >> j) & 0x03)]++;   //masker med 0x03 for å kun få de 2 siste bitsene<br />
		if (i < s.Length - 1)<br />
			vector[(((s[i] << 1) | (s[i + 1] >> 7)) & 0x03)]++;<br />
	}<br />
<br />
	return vector;<br />
}

Not sure if its the best way, but it works! Smile | :)

Any suggestions for improvement are welcome. Have optimized it some, but I wouldnt be surprised if there are better ways to solve this :p

Mvh Sverre

Re: Count bitpattern occurences in textstrings

Luc Pattyn19-Jan-08 9:13

Luc Pattyn

19-Jan-08 9:13

Hi,

from your approach I see you want to get the count for each possible 2-bit pattern at once,
which was not mentioned before.

I think the larger accumulator approach is more elegant and performant than the explicit
join statetement(s) your approach needs.

And I am confident an array-based solution (with precalculated counts) has the highest
performance, assuming the amount of data warrants the initial investment.

Both were introduced in my previous reply.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Count bitpattern occurences in textstrings

invictus319-Jan-08 11:16

invictus3

19-Jan-08 11:16

Actually I was considering the suggestion about array of pre-calculated values, but I couldn't think of a good way of making the bits between the bytes (last bit from one byte and the first one from the next) count.

Not sure if I understand your larger accumulator approach though...

Re: Count bitpattern occurences in textstrings

Luc Pattyn19-Jan-08 11:36

Luc Pattyn

19-Jan-08 11:36

The simplest way to look at it is to have an accumulator that holds two consecutive characters,
which you achieve by shift-and-add or shift-and-or. The trick is to AND away the extraneous
bits that would cause some positions to be counted twice, hence the P+B-1 stuff.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Count bitpattern occurences in textstrings

invictus319-Jan-08 12:57

invictus3

19-Jan-08 12:57

Thats almost what I did if I understand correctly. Although I did only use one character and just shifted the needed bit from the next character into its correct position and then OR'ed it with the one character.

But I think I understand the difference...like:

int16 holding b[0] and b[1]

then shift it 1 position at a time until its shiftet 8 times (therefore all bits from b[0] have been in the MSB position while the first bit from b[1] being in the consecutive position to the right from the MSB...and then jump to b[1] and repeat....

Re: Count bitpattern occurences in textstrings

Luc Pattyn19-Jan-08 14:01

Luc Pattyn

19-Jan-08 14:01

You still have not fully understood my original reply[^] where I use one (or more) precalculated arrays to do the counting.

So counting all the "00" patterns in one data item takes:
- one accumulator adjustment (shift+or)
- one mask
- one array lookup.

And counting all the "00", "01", "10" and "11" patterns in one data item takes:
- one accumulator adjustment (shift+or)
- one mask
- four array lookups.

i.e. I never treat bits individually, I always work B bits at a time.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Count bitpattern occurences in textstrings

invictus321-Jan-08 0:09

invictus3

21-Jan-08 0:09

I wrote a hybrid solution. The first count and the generateTable are just used to make the table so ignore those. The last method is the actual counting. As you can see I look up the entire byte, BUT to count the "in between" bits I use my old method.

Any feedback to this solution? At least I am getting closer :p

<br />
static int[] count(int x, int q)	//use to make the 4-item vector from a number<br />
{<br />
  int size = (int)Math.Pow(2, q);<br />
  int mask = size - 1;<br />
  int[] vector = new int[size];<br />
<br />
  for (int j = 8 - q; j >= 0; j--)<br />
  {<br />
    vector[((x >> j) & mask)]++;<br />
  }<br />
<br />
  return vector;<br />
}<br />
<br />
static int[][] generateTable()		//generates 256 vectors with the count<br />
{<br />
  int[][] table = new int[256][];<br />
<br />
  for (int i = 0; i < 256; i++)<br />
    table[i] = G_Binary2(i, 2);<br />
<br />
  return table;<br />
}<br />
<br />
static int[] count(string s, int[][] table)	//this is the entry point<br />
{<br />
  const int mask = 0x03;<br />
  int[] vector = new int[4];<br />
<br />
  for (int i = 0; i < s.Length; i++)<br />
  {<br />
    int x = s[i] & 0xFF;<br />
    for (int j = 0; j < 4; j++)<br />
      vector[j] += table[x][j];<br />
    if(i < s.Length-1)<br />
      vector[(((s[i] << 1) | (s[i + 1] >> 7)) & mask)]++;<br />
  }<br />
<br />
  return vector;<br />
}<br />

Re: Count bitpattern occurences in textstrings

Luc Pattyn21-Jan-08 0:45

Luc Pattyn

21-Jan-08 0:45

// P bits in pattern
// table holds 2^P one-dimensional tables with 2^(B+P-1) entries each (B=8 here)
static int[] GetCounts(string s, int P, int[][] table) {
    int powerP=1<<P;                // 4 when P==2
    int[] counts=new int[powerP];
    int mask=(0x80<<P)-1;           // 0x1FF when P==2
    int acc=0;
    foreach (char c in s) {
        acc=(acc<<8) | (((int)c) & 0xFF);
        for (int pat=0; pat<powerP; pat++) counts[pat]+=table[pat][acc & mask];
    }
    return counts;
}

BTW: the above is off by one (for P==2), you may or may not care to fix it (not very easy without
performance penalty)

PS: don't use pow() in an integer application, it is both slower and potentially inaccurate.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Count bitpattern occurences in textstrings

invictus321-Jan-08 0:57

invictus3

21-Jan-08 0:57

Thanks! I was not aware of the problem with pow(). is there any similar integer issue/fix with Math.Abs() and Math.sqrt() as well?

Re: Count bitpattern occurences in textstrings

Luc Pattyn21-Jan-08 1:09

Luc Pattyn

21-Jan-08 1:09

invictus3 wrote:
s there any similar integer issue/fix with...

each time you convert to and from reals for no good reason you may loose performance, and
may get wrong results due to limited accuracy and/or rounding errors.

x=iabs(x) ==> if (x<0) x=-x;

x=isqrt(x) ==> no simple int solution, but one seldom needs isqrt

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Count bitpattern occurences in textstrings

invictus321-Jan-08 22:41

invictus3

21-Jan-08 22:41

I am grateful for the help.

I got one final question though. Its more relating to the last topic, Math.pow(). I would run Math.pow(94, x), but since you told me that pow was inefficient for integer operations I wonder what is the better way to solve this?

Re: Count bitpattern occurences in textstrings

Luc Pattyn22-Jan-08 1:07

Luc Pattyn

22-Jan-08 1:07

for loop
while loop
loop + array
94?

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Count bitpattern occurences in textstrings

Ennis Ray Lynch, Jr.15-Jan-08 11:00

Ennis Ray Lynch, Jr.

15-Jan-08 11:00

Use unsafe code, get the pointer to the string data and then cast it into byte or int or whatever size you want. Then use the binary operator & to do the comparison.

Need a C# Consultant? I'm available.

Happiness in intelligent people is the rarest thing I know. -- Ernest Hemingway

Re: Count bitpattern occurences in textstrings

invictus315-Jan-08 23:10

invictus3

15-Jan-08 23:10

This was an interesting idea.

Lets assume I have a string that consists only of the ascii symbols from code 32 and up to 126. Now each character must be at least 7 bits long (but probably are 8 as a result of byte of 8 bits). This leads to alot of possible combinations to be checked with the & operator if I want to check for all possible 2-bit patterns (even the one overlapping into the next character).

If I make the assumption that characters are 8 bits and include the first 0 in my calculations (to make it easier) I would still have to test against all possible combinations of 00, 01, 10, and 11 at all two consecutive slots within each character. Furthermore, I would have to test each of the four against the last bit of the current char and the first bit of the next char.

I am fully aware that this post perhaps should be located in the math/algo forum now. The question I now want to ask is this: is there any clever boolean algebra method for testing for this bit pattern in each possible location or do I have to manually test each of the possible locations?

If any mod reads this then perhaps this and the previous posts can be moved to the correct forum. Thank you.

Re: Count bitpattern occurences in textstrings

Ennis Ray Lynch, Jr.16-Jan-08 4:11

Ennis Ray Lynch, Jr.

16-Jan-08 4:11

I don't have the exact method but the strlen method in C in the assembly uses some really good math in a similar method.

Need a C# Consultant? I'm available.

Happiness in intelligent people is the rarest thing I know. -- Ernest Hemingway

Re: Count bitpattern occurences in textstrings

PIEBALDconsult15-Jan-08 15:41

PIEBALDconsult

15-Jan-08 15:41

return 42 ;

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.