Thursday, January 1, 2009

C# - Counting the number of occurrences of a substring - Take two

In my last post, I stated that I knew of only one way to find a count of all the substrings within a string without looping through the string. Well, a co-worker pointed out to me the other day that I had shown a non-looping way to count all the substrings within a string when I gave a class on Regular Expressions.

That is true. A very simple Regular expression:

new Regex(mySubString).Matches(myString).Count


But, that simple line of code is not the reason for this post. I've found an even deeper issue, and it all comes down to how you define the problem (doesn't it always).

Lets say that my string is "BOBOBOB"

How many instances of the substring "BOB" is in "BOBOBOB"? 2 or 3?

My way in the previous post and the regular expression both agree, the number is 2.

But, say you are writing a game and you are using a string to store previous moves. Each move can have only two choices: a 0 or a 1. So after a number of moves, the storage string could look like this "0111110010100111110110111101".

Now, how many times did the player choose to play a 1 after playing a 1 the prior two moves, i.e. a "111"? If you did a substring search, you would get a 3. But if you count the number of times "111" appear including overlap, the number is much higher: 8.

So, an AI routine that was based on prior moves would need to find all combinations of "111" even if one set is partially contained within another.

So, here is my LOOP that counts the number of substrings within a string, even if the substring is partially contained within another substring.

for (var i = 0; i <= myString.Length - mySubString.Length; i++)
{
if (myString.Substring(i, mySubString.Length) == mySubString)
{
count++;
}
}

11 comments:

Unknown said...

Hello ,

it's should be vice versa in the way we use it (checked right now) .

The correct way is :

new Regex(mySubString).Matches(myString).Count;

For example , to know the number of occurrences of blank space " " in the string str , we write :

new Regex(" ").Matches(str).Count;


Lior .

Lee Saunders said...

Oooopppppsssss!

This is sad ... for two reasons.

1 - I should have verified my code before I posted - I usually do, I do not know why I didn't this time, but that is no excuse.

2 - I't has taken this long for someone to notice.

Thanks again, I've corrected the main text.

Unknown said...

However , using Regex is not recommended for dealing with large amount of text .

I've just written an algorithm which uses Regex for counting the number of occurrences of a given string within another one ;
the algorithm runs like a turtle ...

Replacing the Regex use with a "naive" method , which scans the string from the beginning ,
improved the actual running time of the whole algorithm dramatically .


Lior

Lee Saunders said...

True, its always a balance between using the generic tools in the Language/Framework and "rolling your own".

One works with very little coding but may be so generic or built with so many options that speed is affected for large data sets.

On the otherhand if you only have a little 2k string that you need to "look at", custom code could be a big time waster.

Anonymous said...

Thanks a lot.
It's help me lot

Anonymous said...

there is no need to loop, just split the string using the search substring as the separator, and return the resulting array length

Lee Saunders said...

So, for simple counting of occurrences of a substring we are now up to 3 different ways (See previous post - link at top of page).

string[] s = {"BOB"};

Console.Write("BOB".Split(s,StringSplitOptions.None).Count() - 1);

Of course you need to subtrack 1 because both Length and Count give you the total number of elements in the array and if you split an empty string, you will get a count of 1.

But, the purpose of this post was to find all substrings even if the substring is partially contained within another substring.

Your way produces the same number for strings BOBBOB and BOBOBOB: 2, where it should return 3 for the second case.

Unknown said...

Very, very good. Thanks for sharing!

Anonymous said...

In this example Regex is rather slow. LINQ is also one of the slowest ways as well. One of the fastest ways to do it without looping is the following:

//So if key = "02", and strToSearch = "01,02,03", the number of occurrences should be (8-6) / 2 = 1

int substr_count(String strToSearch, String strKeyToLookFor)
{
return (strToSearch.Length - strToSearch.Replace(strKeyToLookFor, String.Empty).Length) / strKeyToLookFor.Length;
}


Performance benchmarks can also be seen on this post:
http://stackoverflow.com/questions/541954/how-would-you-count-occurences-of-a-string-within-a-string-c

PS: anyone who has done PHP programming will appreciate the method name. :-)

Lee Saunders said...

Wow, anonymous. I'm glad you read my blog ... wait, you didn't.

This was a take-two as the title clearly states. If you had taken the time to read the original post, it would have shown that I posted about this method, oh, 3 and a half years ago!?!

Thanks for your comment.

FireMystDL said...

If you're looking for the fastest way, check out this blog:
http://blogs.davelozinski.com/curiousconsultant/csharp-net-fastest-way-to-check-if-a-string-occurs-within-a-string

as it benchmarks about a dozen different ways, including parallel loops.