25 November 2008

I detect a pattern here

I've had Coding Horror's post on regular expressions bookmarked for a while now, just waiting for the chance to take a few minutes to type "Right on!" For certain validation problems, a regex is the only way to go. At Vovici, I used them with the RegularExpressionValidator control to ensure that a text box was, say, filled in with a valid e-mail address or with a URL from a particular domain. And about once a quarter my colleague Cap would IM me with a request for a quick regex consult.

You can also use a regex to make sure that a text box is filled in with a valid date (in, MM-DD-YY, format, for instance), but in this case you're usually better off using a specialized date picker, for instance, one that presents a pop-up monthly calendar and all the user has to do is click a number.

The big problem with regular expressions is the proliferation of implementations and all the bells and whistles that come with. For example, we found a particularly useful pattern at RegExLib.com to match e-mail addresses that include the display-name part (as in "User, Joe" <joe.user@example.com>), but the pattern wasn't useful for client-side validation because it used features that depended on a browser-specific regex engine. So a reference book like Jeffrey Friedl's Mastering Regular Expressions is really handy to help you keep track of platform-specifics. By all means, use the contributed patterns form a site like RegExLib.com, but don't put a pattern into production that you don't understand yourself.

Another tool that you may find useful is Ivaylo Badinov's test harness for regular expressions, REGex TESTER.

Just to amplify a couple of Jeff Atwood's points:

Do not try to do everything in one uber-regex. I know you can do it that way, but you're not going to. It's not worth it. Break the operation down into several smaller, more understandable regular expressions, and apply each in turn. Nobody will be able to understand or debug that monster 20-line regex, but they might just have a fighting chance at understanding and debugging five mini regexes.

This is also good advice for smaller patterns, too. If you're trying to recognize U.S. telephone numbers, for instance, start with a pattern that recognizes area codes (something like /\d{3}/) and one that recognizes exchange and number body (/\d{3}-\d{4}/) and then put the two patterns together (into /(\d{3}-)?\d{3}-\d{4}/).

Regular expressions are not Parsers. Although you can do some amazing things with regular expressions, they are weak at balanced tag matching. Some regex variants have balanced matching, but it is clearly a hack—and a nasty one. You can often make it kinda-sorta work, as I have in the sanitize routine. But no matter how clever your regex, don't delude yourself: it is in no way, shape or form a substitute for a real live parser.

Exactly. Regular expressions are good for problems that call for a bounded degree of nesting: breaking up a file of XML into tokens that represent the element and attribute names, for instance. These problems are what the language translation people would call lexical analysis. For problems that permit arbitrarily deep nesting, like parsing the stream of XML tokens into a document tree, ensuring that each tag is properly closed and nested, you're doing syntactic analysis, and you need a tool like yacc.


Anonymous said...

One more thought on the RegularExpressionValidator: you will want to code your pattern once as a const string that you can get to from all your code; assign the string to the validator's ValidationExpression property in your code-behind file.

Anonymous said...

And, perhaps not so coincidentally, regular expressions appeared in the technical screening exercise that I took today.