Monday, April 6, 2009

The "Weekly" Newsgroup post #2

This is my second "Weekly" (Oh, that's a laugh) Newsgroup post.

In this post, we have what seems like a simple question: How to ensure that at most there is one occurrence of each character from a list of characters.

Or, to be more specific, Byron gave this example:

For example, a string [starting with a capital 'D' and a dash - ed] can have up to 4 unique digits between 1 and 4.

The following strings should pass:
D-1
D-14
D-1234
D-41

The following should fail:
D- (no digit)
D-0 (out of range)
D-5 (out of range)
D-11 (repeated digit)
D-1 2 (embedded space)
D-12345 (out of range and too long)

The first reply was by Peter Duniho, who instead of trying to answer the question suggested that Bryon try using C# instead. Many other replies sought to answer the question with brute force, the same way that C# would have to.

Now, I did not select this post because I contributed to it, but I did suggest using lookaheads and a true RegEx guru came up with the an amazingly elegant solution.

Which Peter immediately to exception to. Now, I would be the first to say that Peter is an amazing C# programmer and he leaves me in the dust, but it is BECAUSE he is such a great C# programmer that he seems to look at everything through C# goggles.

His argument boils down to Regular Expressions are hard and some people do not understand them, so do not use them.

Well, I agree that if you do not understand Regular Expressions, you should not use them, but the same holds true to C#, if you do not understand a language you should not be programming in it.

But, Peter points out, look how long it took this group to find an elegant solution. That definitely shows how hard Regular Expressions are. Well, this is a C# group not a Regular Expression group so it makes sense that it could take us awhile. I bet if we asked a non-trivial question on a Regular Expression group it would take quite a few iterations till they came up with an elegant C# solution.

Why did I not point this out to Peter in the post? I've found that you never really win an argument with him ... it is just best to smile, nod your head slowly and back out of the discussion.

So, with Peter's straw man firmly destroyed, lets take a look at this "elegant" Regular Expression by Jesse Houwing:

^D-(?:([1234])(?!.*\1)){1,4}$

The beginning is fairly straight forward, the string must start with a "D-"

Then we see, at the innermost set of parenthesis is ([1234]). This is a capture group that simply means, the next character I find must be in the set of 1 through 4 and then save it as a capture. Since the capture is not named, it can be referred to by its number. Since it is the first capture, it is #1, or \1, in RegEx speak.

Next we see another set of parenthesis, but the opening paren is immediately followed by "?!". This turns the RegEx code inside the parenthesis into what is called a Negative Lookahead. What it does is, it scans the rest of the string and if a match is found for the lookahead, the entire Regular Expression fails.

The RegEx code for the Lookahead is: .*\1

This simply says can I find a match to the character I found in the capture with any number of characters before it? Or, i.e. does the character ever appear again?

All of this is surrounded by a set of parens with the opening paren immediately followed by "?:". This says that we do not want to capture, i.e. remember the strings that RegEx would normally keep as a sub-group.

last but not least, "{1,4}$" informs the RegEx engine that the string must end with 1 to 4 matches from the parens that we said we did not want to keep track of for later.

This amazing little bit of code will fulfill all the requirements the original requester asked for and teach a bit as well!

No comments: