Re: Regular Expressions - C# Discussion Boards

Bassam Abdul-Baki10-Sep-10 4:09

10-Sep-10 4:09

Luc Pattyn wrote:
1.
we have a regular expressions forum

I knew I should have checked all the forums. However, I am using C#. Smile | :)

Maybe someone will move this to the right forum.

Luc Pattyn wrote:
2.
you're worried about readability and yet you use regular expressions?

It's a simple numeric search. I could easily add a comment that I'm actually searching for 6-2-4, but only 2-4 is needed since the numbers begin with 6.

Luc Pattyn wrote:
3.
what do you want to happen with numbers having multiple two's and four's?
is 612423 two matches? a match and a fail?

As long as there's a 2 before a 4, it passes. The string 612345 was just an example. It could be 64444444444444422222222222123678901236789012345678903456789034567890 and it would still pass.

PIEBALDconsult10-Sep-10 4:24

PIEBALDconsult

10-Sep-10 4:24

Bassam Abdul-Baki wrote:
as there's a 2 before a 4, it passes

Check for the 6

Compare String.IndexOf ( 2 ) < String.LastIndexOf ( 4 ) (Check for -1 too)

Done

Bassam Abdul-Baki10-Sep-10 4:26

10-Sep-10 4:26

Nice! I like this, but my example only had two numbers for simplicity. I actually have three numbers. Missed it by this much. Smile | :)

PIEBALDconsult10-Sep-10 4:37

PIEBALDconsult

10-Sep-10 4:37

Well, why didn't you say so?

http://msdn.microsoft.com/en-us/library/5xkyx09y.aspx[^]

if ( s [ 0 ] == firstdigit )

and int x = s.IndexOf ( secondsdigit ) > 0

and int y = s.IndexOf ( thirddigit , x ) > 0

and int z = s.IndexOf ( fourthdigit , y ) > 0

etc.

Perhaps in a loop.

Bassam Abdul-Baki10-Sep-10 4:46

10-Sep-10 4:46

D'oh! That would work. Haven't used IndexOf in quite a while. Blush | :O

Thanks!

Alan Balkany10-Sep-10 4:02

10-Sep-10 4:02

If you really want to optimize this, just step through the characters with a state machine.

This is faster than using a regular expression, which has overhead because it's designed to handle the much more complex general case.

Bassam Abdul-Baki10-Sep-10 4:11

10-Sep-10 4:11

Alan Balkany wrote:
If you really want to optimize this, just step through the characters with a state machine.

Not familiar with it. However, I have an obscene amount of strings that I'm parsing and it's coded in C#.

Alan Balkany10-Sep-10 4:12

10-Sep-10 4:12

Are you always looking for these three digits, or was that just an example?

Bassam Abdul-Baki10-Sep-10 4:15

10-Sep-10 4:15

The three digits were an example, but the same digits all the time. So yes and no.

Alan Balkany10-Sep-10 4:22

10-Sep-10 4:22

A state machine uses a single int to mark where you are in your search. E.g.:

int state = 0;

foreach (char c in nextString)
    switch (state)
    {
        case (0): if (c == '6') state = 1;
            break;
        case (1): if (c == '2') state = 2;
           break;
        case (2): if (c == '4') state = 3;
           break;
    }

if (state == 3)
    return (FOUND);
else
    return (NOT_FOUND);

This is way faster than using regular expressions, which must handle much more complex cases.

Bassam Abdul-Baki10-Sep-10 4:27

Re: Regular Expressions [modified]

10-Sep-10 4:27

Groovy (not the language)! I'll try it out. Thanks!

Luc Pattyn10-Sep-10 4:34

10-Sep-10 4:34

I'm afraid the switch will turn out to be expensive, and it really is unnecessary. How about:

int len=s.Length;
int i=0;
while(i<len) {
    char c=s[i++];
    if (c=='6') {
        while(i<len) {
            c=s[i++];
            if (c=='2') {
                while(i<len) {    // FIXED
                    c=s[i++];
                    if (c=='4') return true;
                }
            }
        }
    }
}
return false;

or the more structured:

string search="624x";  // FIXED
int iSearch=0;
char cSearch=search[iSearch];
foreach(char c in s) {
    if(c==cSearch) {
        cSearch=search[++iSearch];    // FIXED
        if (cSearch=='x') return true;
    }
}
return false;

BTW: none if these fulfill the requirement, as they would accept 6424 whereas the OP wants to reject that; so the state machine needs to be a tiny bit more complex.

Smile | :)

Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

modified on Friday, September 10, 2010 11:03 PM

Alan Balkany10-Sep-10 4:39

10-Sep-10 4:39

Yes, your first example is a more efficient implementation of the state machine, but I was trying to illustrate the concept with the most clarity.

My example also could be optimized by immediately returning when the third digit was found, but I wanted to show a pure example of a state machine.

Luc Pattyn10-Sep-10 4:56

10-Sep-10 4:56

Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

Bassam Abdul-Baki10-Sep-10 4:40

10-Sep-10 4:40

Luc Pattyn wrote:
BTW: none if these fulfill the requirement, as they would accept 6424 whereas the OP wants to reject that; so the state machine needs to be a tiny bit more complex.

No, I wouldn't. When I said as long as 6-2-4 appear in that order, I meant 6[0-9]*2[0-9]*4. Smile | :)

Bassam Abdul-Baki10-Sep-10 4:59

10-Sep-10 4:59

The first example you provided is quite clear, although you had a len missing in the third while loop. I'm not so sure about the second one(since I don't have a compiler at work - not my thing anymore). Never mind. I got it. Except the 'x' confused the hell out of me. Smile | :)

I think I like the second one better, but I need to use the one that's faster.

Thanks!

Luc Pattyn10-Sep-10 5:19

10-Sep-10 5:19

sorry. I fixed both mistakes now.

I don't expect you would notice a significant performance difference between both approaches.

Smile | :)

Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

Bassam Abdul-Baki10-Sep-10 5:35

10-Sep-10 5:35

No, but I am wondering between your first method and PIEBALDconsult's use of IndexOf above. Your second method actually has a more dynamic use for me which I may or may not need. Either way, it helped me learn what I've forgotten.

Thanks!

Luc Pattyn10-Sep-10 5:49

10-Sep-10 5:49

IndexOf is fine for short strings; it is also fine for long strings if you know the character is present. It is a waste if you need several of them and most of those return nothing.

BTW: there also is a LastIndexOf which actually scans from right to left. So looking for 2...4 could be achieved like this:

int i2=s.IndexOf('2');
if (i2>=0 && i2<s.LastIndexOf('4')) return true;

Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

Bassam Abdul-Baki10-Sep-10 5:55

10-Sep-10 5:55

Thanks, I saw that. Actually, the string I'm searching for will only appear roughly one third of the time. Why would it be a problem if the character is not present? It should basically scan the entire string just like your method does. However, I'm inclined to believe that Microsoft must have optimized the IndexOf and LastIndexOf to work just as well as your first search method. It's your second method that I'm still deciding on.

Luc Pattyn10-Sep-10 6:00

10-Sep-10 6:00

an IndexOf-based approach, when using several calls to such methods, has the disadvantage of scanning the string several times; a state-machine based approach typically scans just once.

I'm not sure what you mean by my first and second approach; the one with cSearch is a bit neater than the nested while loops, as it takes less code and scales better when more than a few items need to be present in a specific order. As I said, the speed difference would be minimal, the only overhead it adds is advancing one position in "search" each time the state-machine's state changes (as opposed to entering another nested while loop). When in doubt, just give it a try.

Smile | :)

Luc Pattyn [Forum Guidelines] [Why QA sucks] [My Articles] Nil Volentibus Arduum

Please use <PRE> tags for code snippets, they preserve indentation, and improve readability.

Bassam Abdul-Baki10-Sep-10 6:07

10-Sep-10 6:07

Luc Pattyn wrote:
an IndexOf-based approach, when using several calls to such methods, has the disadvantage of scanning the string several times;

PIEBALDconsult's example shows the IndexOf(Char, Int32) or LastIndexOf(Char, Int32) searching from the last known position. That's identical to what you're doing, but I would think Microsoft has it optimized.

Luc Pattyn wrote:
I'm not sure what you mean by my first and second approach;

Your cSearch (while) was the first one. That one is somewhat self-explanatory, but the second one allows me to create more dynamic searches in case the values that I'm searching for change. It also means I am not fixed at N amount of while loops.

Luc Pattyn wrote:
When in doubt, just give it a try.

Yeah, I intend to. I just don't have Visual Studio here and no admin rights to install the Express version. We need a Portable Visual Studio Microsoft. Smile | :)

harold aptroot10-Sep-10 6:27

harold aptroot

10-Sep-10 6:27

Bassam Abdul-Baki wrote:
but I would think Microsoft has it optimized

I went to look, but found out that it's implemented as a call to "somewhere" (0xFFFFFFFFFF47A100, any clues?)
So I still know nothing. It could be optimized.. but knowing MS it is likely to be a highly generic "good for all CPU's" routine and therefore rather slow. That's what they usually do.

AspDotNetDev10-Sep-10 6:19

AspDotNetDev

10-Sep-10 6:19

Luc Pattyn wrote:
IndexOf is fine for short strings; it is also fine for long strings if you know the character is present. It is a waste if you need several of them and most of those return nothing.

Huh? You do know there are overloads for IndexOf that take the starting position to scan from, right? Scan for first character, then scan for the second character starting from the position that follows the first, then scan for the third character starting from the position that follows the second character, and so on. I figure that would be very clear code that would be very performant as well.

// You'd need to add logic to test if any of the characters don't exist (i.e., if IndexOf returns -1).
int index = str.IndexOf(char3, str.IndexOf(char2, str.IndexOf(char1) + 1) + 1);

[Forum Guidelines]

Bassam Abdul-Baki10-Sep-10 6:34