25 November 2008

I detect a pattern here

I've had Coding Horror's post on regular expressions bookmarked for a while now, just waiting for the chance to take a few minutes to type "Right on!" For certain validation problems, a regex is the only way to go. At Vovici, I used them with the RegularExpressionValidator control to ensure that a text box was, say, filled in with a valid e-mail address or with a URL from a particular domain. And about once a quarter my colleague Cap would IM me with a request for a quick regex consult.

You can also use a regex to make sure that a text box is filled in with a valid date (in, MM-DD-YY, format, for instance), but in this case you're usually better off using a specialized date picker, for instance, one that presents a pop-up monthly calendar and all the user has to do is click a number.

The big problem with regular expressions is the proliferation of implementations and all the bells and whistles that come with. For example, we found a particularly useful pattern at RegExLib.com to match e-mail addresses that include the display-name part (as in "User, Joe" <joe.user@example.com>), but the pattern wasn't useful for client-side validation because it used features that depended on a browser-specific regex engine. So a reference book like Jeffrey Friedl's Mastering Regular Expressions is really handy to help you keep track of platform-specifics. By all means, use the contributed patterns form a site like RegExLib.com, but don't put a pattern into production that you don't understand yourself.

Another tool that you may find useful is Ivaylo Badinov's test harness for regular expressions, REGex TESTER.

Just to amplify a couple of Jeff Atwood's points:

Do not try to do everything in one uber-regex. I know you can do it that way, but you're not going to. It's not worth it. Break the operation down into several smaller, more understandable regular expressions, and apply each in turn. Nobody will be able to understand or debug that monster 20-line regex, but they might just have a fighting chance at understanding and debugging five mini regexes.

This is also good advice for smaller patterns, too. If you're trying to recognize U.S. telephone numbers, for instance, start with a pattern that recognizes area codes (something like /\d{3}/) and one that recognizes exchange and number body (/\d{3}-\d{4}/) and then put the two patterns together (into /(\d{3}-)?\d{3}-\d{4}/).

Regular expressions are not Parsers. Although you can do some amazing things with regular expressions, they are weak at balanced tag matching. Some regex variants have balanced matching, but it is clearly a hack—and a nasty one. You can often make it kinda-sorta work, as I have in the sanitize routine. But no matter how clever your regex, don't delude yourself: it is in no way, shape or form a substitute for a real live parser.

Exactly. Regular expressions are good for problems that call for a bounded degree of nesting: breaking up a file of XML into tokens that represent the element and attribute names, for instance. These problems are what the language translation people would call lexical analysis. For problems that permit arbitrarily deep nesting, like parsing the stream of XML tokens into a document tree, ensuring that each tag is properly closed and nested, you're doing syntactic analysis, and you need a tool like yacc.

18 November 2008

Fan belt?

The Australian Computer Museum Society has agreed to lend a 1960's-era IBM tape drive to the cause of recovering data on lunar dust that was collected on the Apollo XI, XII, and XIV missions, reports Nic MacBean. The 7-track IBM 729 Mark V drive is described as "in need of tender love and care."

(Link via Risks Digest.)

11 November 2008

A freebie

IEEE/Computer Society has announced a new benefit to members, to be available in December: free access to 600 titles in the O'Reilly Safari library. Depending on what is made available, this could save me some slots in my current 10-slot subscription.

And that do I get for the $15/month that I'm paying? Right now, I have these volumes on my virtual bookshelf:

  • Duthie and MacDonald, ASP.NET in a Nutshell, 2/e
  • Meyer, CSS: The Definite Guide, 3/e
  • Bergsten, JavaServer Pages, 3/e: I checked this out for a specific project, and will release it soon
  • Pogue, Mac OS X Leopard: The Missing Manual
  • Friedl, Mastering Regular Expressions, 3/e
  • Snell et al., MCPD Self-Paced Training Kit (Exam 70-547): this goes back once I pass the exam
  • Northrup et al., MCTS Self-Paced Training Kit (Exam 70-536): ditto
  • Pogue et al., Windows XP Pro Edition: The Missing Manual, 2/e

I can swap out anything after holding it for 30 days. Most important, I can change up to the next edition of a book without having to pulp the old one.

05 November 2008

Exam prep: 7

Well, there's been lots of excitement these past few weeks, in and out of the office, but I'm trying to keep my feet moving in my studies for the MCTS exam 70-536. This week I started chapter 8 of the standard prep guide: application domains and services.