Re: Substring Matching (Harder than it sounds) - Algorithms Discussion Boards

Re: Substring Matching (Harder than it sounds)

Skippums29-Aug-07 6:30

29-Aug-07 6:30

Doesn't that algorithm assume that s2 contains the start of the string s1? This is not actually the case. The longest substring could occur anywhere in either string. So for that algorithm to work, I would have to try s1, then s1 without the first character, then s1 without the second character, then s1 without the third character, etc. I think it would actually be faster (not really sure) to do it the way I proposed with the boolean table.

Re: Substring Matching (Harder than it sounds)

Luc Pattyn29-Aug-07 6:32

Luc Pattyn

29-Aug-07 6:32

Hi,

I can offer a few ideas that may reduce the size of the problem:

1. remove in string1 all chars that don't occur in string2 and vice versa;
this process typically shortens the strings, and reduces the problem to a
larger number of much smaller problems.
In your example it replaces ("abacadaba", "dbaadabb")
by ("aba", "dbaadabb") and ("adaba", "dbaadabb") based on 'c'
and the first of these can be reduced further based on 'd'.

2. search for small patterns that occur in one and not in the other string.
Again in your example that would be "aa", allowing a further reduction (here
you can split in between the two a's).

3. Once you have sufficiently reduced you could consider your quadratic cost
approach.

4. but as an alternative you could do a kind of binary search on the match length:
- start with an arbitrary length, in the shorter string, select some substring
with that length, and try to find it in the other string. If you find one,
the match can only be that size or larger. If you don't find one, you could
try all other substrings or fall back to a smaller length.
It all depends on the kind of data you want to handle well. Your example is not
a real indication I guess.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

this weeks tips:
- make Visual display line numbers: Tools/Options/TextEditor/...
- show exceptions with ToString() to see all information
- before you ask a question here, search CodeProject, then Google

Re: Substring Matching (Harder than it sounds) [modified]

Skippums29-Aug-07 7:42

Skippums

29-Aug-07 7:42

OK. I have thought about it some more, and have come up with the following solution:

Let lstr be the longer string and sstr be the shorter string...

First, I will build a dictionary with characters as keys and a singly linked list of integer indexes as the value, which will represent lstr. This table will take O(lstr.len) memory and O(lstr.len) time to build, but will be keyed in O(1) time.

Then I will make the following recursive function, which is sent the lookup table, the two strings, and an integer representing the longest string found thus far:

1. If the longest string found is longer than sstr sent in, return empty string.
2. Initialize result to be the empty string.
3. Choose character c from sstr to be the center character.
4. For each index in the dictionary for c, find the matching substring (match).
5. If match.len from (3) > result.len declared in (1), result = match. If more indexes occur for c in the dictionary, go back to (3)
6. If result.len > sstr.len / 2, return result.
Else
resultLow = recursively call this using sstr.substr(0, sstr.len / 2) and
resultHigh= recursively call this using sstr.substr(sstr.len / 2)
return the longest of (result, resultLow, resultHigh)

This will limit my searches so I don't have to waste my time comparing things that are already found. I think this algorithm will turn out to be O(lstr.len * log(lstr.len)). Any thoughts?

-- modified at 14:28 Wednesday 29th August, 2007

Re: Substring Matching (Harder than it sounds)

Russell^'29-Aug-07 8:06

Russell^'

29-Aug-07 8:06

I like very much the idea of build the dictionary! But I mean a dictionary of all the words inside s1 and s2.(be carefull with tabulations: " " "," "." ...)
In that way you can convert the long array of char on a short array of integer:
the integer are simply the substitution of the words with the index of that word inside the dictionary (.... the dictionary could not exist really: we don't need to came back to the original text).
I think that this way is used also from some compressing algorithm.
Now you can apply the algoruthm that you want, but surely the comparison must run faster (it depends only on how much time you spend on build the dictionary)

Russell

Re: Substring Matching (Harder than it sounds)

Skippums29-Aug-07 8:28

Skippums

29-Aug-07 8:28

It is not a dictionary of words, but a dictionary of characters. That way, I know there are only about 40 characters that I have to worry about (a-z (case insensitive), 0-9, and a few other special characters). I am in this instance doing character matching, not word matching, and I am only building the dictionary for the longer of the two strings s1 or s2.

Re: Substring Matching (Harder than it sounds)

Russell^'29-Aug-07 9:02

Russell^'

29-Aug-07 9:02

Skippums wrote:
It is not a dictionary of words, but a dictionary of characters

My dictionary is different from yours: if you have little string that are repeated you can think that that are 'words' and proceed with my method (only to reduce the length of the array) Big Grin | :-D

Only you can decide, I don't know what kynd of strings you are working on and what kynd of semplification can be done.

Russell

Re: Substring Matching (Harder than it sounds)

PIEBALDconsult29-Aug-07 18:03

PIEBALDconsult

29-Aug-07 18:03

Sounds like part of doing a diff of two file.

Re: Substring Matching (Harder than it sounds)

Mark Churchill29-Aug-07 18:26

Mark Churchill

29-Aug-07 18:26

I get the feeling its going to end up On^2ish no matter how you code around it. How big are your strings?

Brute force without the ram overhead would be to scan index pointers thru A and B, looking ahead in A as you get matches in B and keeping the highest match. Then incrementing the A pointer to the next character etc.

Depending on what your text looks at you might be able to hack in some optimisations (like checking if A[ia]+longestmatchlength != B[ib]+longestmatchlength -> increment ib by longestmatchlength).

In fact that small optimisation could speed an individual scan of the B array up over time - possibly even flattening the On^2 back down to closer to On.

Mark Churchill
Director
Dunn & Churchill

Re: Substring Matching (Harder than it sounds)

cp987629-Aug-07 23:55

cp9876

29-Aug-07 23:55

There has been a lot of work done on this for diff'ing files and many algorithms published. Just google on diff algorithms, there are a couple of articles on CP, e.g. here[^] and here[^]

Peter
"Until the invention of the computer, the machine gun was the device that enabled humans to make the most mistakes in the smallest amount of time."

Re: Substring Matching (Harder than it sounds)

Skippums30-Aug-07 4:04

Skippums

30-Aug-07 4:04

The second link that you gave proposes the exact same idea as to how to do this that I came up with yesterday. I will only perform step #1, but that page does confirm to me that my idea on how to tackel this is a valid one. Using that algorithm, I think that I will eliminate most of the checks, which will yield an O(n*log(n)) algorithm (I think). Thank you very much for posting this so I could confirm that I am not going down a dead end! Thanks to everyone else who posted responses for me as well!

Re: Substring Matching (Harder than it sounds)

Skippums4-Sep-07 11:23

Skippums

4-Sep-07 11:23

Here is the code (in c#) that I finally ended up implementing. As I said before, I think it is O(n*log(n))...

public static string LongestMatchingSubstring(string s1, string s2, int minMatchLength, bool caseSensitive) {
if (string.IsNullOrEmpty(s1) || string.IsNullOrEmpty(s2))
return string.Empty;
string shorter, longer;
if (s1.Length > s2.Length) {
shorter = s2;
longer = s1;
} else {
shorter = s1;
longer = s2;
}
if (shorter.Length < minMatchLength)
return string.Empty;
Dictionary<char, IList<int>> hashTable = new Dictionary<char, IList<int>>(
Math.Min(caseSensitive ? 96 : 64, longer.Length));
if (caseSensitive) {
for (int i = 0; i < longer.Length; ++i) {
char c = longer[i];
if (!hashTable.ContainsKey(c))
hashTable.Add(c, new List<int>());
hashTable[c].Add(i);
}
} else {
for (int i = 0; i < longer.Length; ++i) {
char c = char.ToLower(longer[i]);
if (!hashTable.ContainsKey(c))
hashTable.Add(c, new List<int>());
hashTable[c].Add(i);
}
}
string result = string.Empty;
--minMatchLength;
FindLongestMatchingSubstring(shorter, longer,
hashTable, ref result, ref minMatchLength, caseSensitive);
return result;
}

private static void FindLongestMatchingSubstring(string str1, string str2,
Dictionary<char, IList<int>> ht, ref string result, ref int minMatchLength, bool caseSensitive) {
if (str1.Length <= minMatchLength)
return;
int midpoint = str1.Length / 2;
char key = caseSensitive ? str1[midpoint] : char.ToLower(str1[midpoint]);
if (ht.ContainsKey(key)) {
foreach (int index in ht[key]) {
string temp = GetFullMatch(str1, midpoint, str2, index, caseSensitive);
if (temp.Length > minMatchLength) {
minMatchLength = temp.Length;
result = temp;
}
}
}
FindLongestMatchingSubstring(str1.Substring(0, midpoint),
str2, ht, ref result, ref minMatchLength, caseSensitive);
FindLongestMatchingSubstring(str1.Substring(midpoint + 1),
str2, ht, ref result, ref minMatchLength, caseSensitive);
}

private static string GetFullMatch(string str1, int str1Idx, string str2, int str2Idx, bool caseSensitive) {
int minOffset = -Math.Min(str1Idx, str2Idx);
int maxOffset = Math.Min(str1.Length - str1Idx, str2.Length - str2Idx) - 1;
int lowOffset = -1, highOffset = 1;
if (caseSensitive) {
for (; lowOffset >= minOffset && str1[str1Idx + lowOffset] ==
str2[str2Idx + lowOffset]; --lowOffset)
;
for (; highOffset <= maxOffset && str1[str1Idx + highOffset] ==
str2[str2Idx + highOffset]; ++highOffset)
;
} else {
for (; lowOffset >= minOffset && char.ToLower(str1[str1Idx + lowOffset]) ==
char.ToLower(str2[str2Idx + lowOffset]); --lowOffset)
;
for (; highOffset <= maxOffset && char.ToLower(str1[str1Idx + highOffset]) ==
char.ToLower(str2[str2Idx + highOffset]); ++highOffset)
;
}
++lowOffset;
--highOffset;
return str2.Substring(str2Idx + lowOffset, highOffset - lowOffset + 1);
}

Hope this helps some people!

Jeff

Re: Substring Matching (Harder than it sounds)

Luc Pattyn4-Sep-07 13:06

Luc Pattyn

4-Sep-07 13:06

Skippums wrote:
I think it is O(n*log(n))...

I did not look at the code, but why don't you measure it?
It suffices to perform three runs with complexity c, 10*c and 100*c
where c should take a measurable time (say 1 second)
From the measured times, you can figure out the curve that fits...

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Substring Matching (Harder than it sounds)

Skippums4-Sep-07 13:26

Skippums

4-Sep-07 13:26

It is slightly more difficult, though, because the time it takes depends on the data. For example, if the longest substring matches immediately and is longer than half the length of the shorter substring, then it takes only O(n) time. If none of the charaters from either string match, it also only takes O(n) time. If the smaller string is shorter than the minimum result you are willing to accept, it takes O(1) time to inform me that I am an idiot for trying. I would like to test it for the worst case senario, but I am having trouble figuring out what that is. Anyone have any insight to a case when this will perform worse than O(n)?

Re: Substring Matching (Harder than it sounds)

Luc Pattyn5-Sep-07 9:00

Luc Pattyn

5-Sep-07 9:00

Seems like you need a lot of test cases (random data could be considered) and a good
definition of what your primary measure of size "n" is.

Lacking these, your "I think it is O(n*log(n))..." seems a bit premature.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Substring Matching (Harder than it sounds)

Skippums5-Sep-07 10:46

Skippums

5-Sep-07 10:46

Worst case that I can think of is if you have a string like 0101010...10 of length m = 2^i-1 where i is a + int, and another of all 1's of length n where n is any + int > m. Then this algorithm will perform approximately (m / 2) * 2 * n * log2(m + 1) = m*n*i operations, or O(n*n*log(n)). So I guess that solves whether or not it is O(n * log(n))! It still should outperform almost any other algorithm when 1) The alphabet over which you are performing the search is reasonably large, 2) The expected overlap between the two strings is reasonably large. Ultimately, I don't really care what the final time is on this algorithm. I am really only looking for any advice on how I can improve it. Thanks,

Jeff

Re: Substring Matching (Harder than it sounds)

Member 419459329-Mar-08 9:55

Member 4194593

29-Mar-08 9:55

Sorry for this late posting, but I have just been reading old forums and ran into this.

If you are still interested in this, I have an algorithm that runs pretty quickly, abet, written in MASM, but the method could be written in C# (I think it could, I have never coded in C#). I have a complete working program - source and EXE and a method to enter test cases via a file with the strings. It is quite lengthy, but mostly commentary. The source with comments is about 2250 lines. The MaxSubstring function is 472 lines with comments. Assembled with MASM 6.15.

Dave Augustine

Need some intelligent people...

Paddy Boyd29-Aug-07 2:20

Paddy Boyd

29-Aug-07 2:20

It's been a long time since i did any really hard maths and a friend of mine just asked me the following question:

"How can I easily find all possible solutions for

3125r = 1024x - 8404

where r and x are both integers, and r is divisible by 5"

Any very intelligent people out there who know of an answer?

Re: Need some intelligent people...

Luc Pattyn29-Aug-07 2:56

Luc Pattyn

29-Aug-07 2:56

Hi,

there is an infinite number of solutions, given by:

r=1020+5120t
x=3121+15625t

where any value of t will do.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Need some intelligent people...

Paddy Boyd29-Aug-07 3:42

Paddy Boyd

29-Aug-07 3:42

I thought there would be any number of solutions (with no particular grounding or proof). But that's certainly a faster way of finding them...

Re: Need some intelligent people...

Paddy Boyd29-Aug-07 3:55

Paddy Boyd

29-Aug-07 3:55

Would it be an impudent question to ask how you arrived at this solution? Trying to work back from your solution leaves me a bit stumped...

Re: Need some intelligent people...

Luc Pattyn29-Aug-07 4:42

Luc Pattyn

29-Aug-07 4:42

Hi Paddy,

there are several ways to find the solution(s) of ax+by=c.

First of all a single, linear equation in two variables has either no solution
at all (say 2x + 4y = 1) or an infinite number, by adding the second coefficient
to the first number and subtracting the first coef from the second number.
So the main problem is to find a first solution.

Method 1: write a little prog, assume x=u+bt and y=v-at and have it try a lot
of values u (keep t at 0) and calculate real v until it happens to be integer.

Method 2: follow the strict mathematical approach; that's what you would need
if decent code is required to solve such problems in general

Method 3: the way in-between, with a couple of shortcuts, the ideal way
to solve a single problem manually. Like so:

3125r = 1024x - 8404 with constraint r=multiple of 5

lets try to replace variables by other variables and reduce the size of the
constants:

right hand side is multiple of 4, hence r must be multiple of 20, say r=20a

3125 * 20a = 1024x - 8404

15625a = 256x - 2101

lets ignore all multiples of the smallest coef, 256

15625=61*256+9
-2101=-9*256+203

9a = 203 (modulo 256)

try a couple of numbers 203 + 256k until a multiple of 9 is found;
this is bound to happen in 9 tries !

one immediately sees a solution is a=(203+256)/9=51
(this is where the mathematician would go more formally !)

hence r=1020, then calculate the x that goes with it, and add one variable t
with the right coefs.

Smile | :)

Luc Pattyn [Forum Guidelines] [My Articles]

Re: Need some intelligent people...

Paddy Boyd29-Aug-07 4:44

Paddy Boyd

29-Aug-07 4:44

Thanks very much for your time Blush | :O

)

Re: Need some intelligent people...

Russell^'29-Aug-07 4:54

Russell^'

29-Aug-07 4:54

Paddy Boyd wrote:
r is divisible by 5

Then use r=5*y, then:

3125*5*y = 1024x - 8404

where y and x are both integers

It is a nice start point

Russell

Re: Need some intelligent people...

cp987629-Aug-07 16:18

cp9876

29-Aug-07 16:18

Just in case Luc's asleep (seems to be never) you could use this applet[^] and solve
-1024x + 15625y + 8404 = 0
then use r = 5y.
If you tick the box 'Step-by-step' it even shows all working!

The source code for this applet is 3742 lines long! (mostly to do with the quadratic problem I guess)

Peter
"Until the invention of the computer, the machine gun was the device that enabled humans to make the most mistakes in the smallest amount of time."

Re: Need some intelligent people...

Luc Pattyn29-Aug-07 23:41

Luc Pattyn

29-Aug-07 23:41

cp9876 wrote:
seems to be never

Really? Gee I must be sleeptyping again. Laugh | :laugh:

Luc Pattyn [Forum Guidelines] [My Articles]

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.