18 December 2009

Regex mystery

One little bit of the app that I've been working on for the past several months is an HTML text box where the editor/producer can enter a relative URL that identifies an image file. But, in reality, it's common for the editor to have an absolute URL that he/she is pasting into the text box (maybe context-clicking to grab a URL from elsewhere on the web site), so one of the bits of processing to be done in JavaScript is to remove the scheme and server name from the URL. My client has multiple media servers within its client.org domain. So I wrote this tiny function, which in essence is nothing but



var URL_REGEX = /(http:\/\/)?(\w+\.{0,1})*client\.org/i;

function stripUrlPrefix (url) {
return url.replace(URL_REGEX, '');
}



stripUrlPrefix() removes the scheme, if present, and the server name, and it usually works like a champ.

However, Tony on the testing team found that the following input string (a real path name from one of our servers) sends the regex engines in IE 7 and Firefox 3 completely out to lunch:



/images/ap//AP_News_Wire:_World_News/3_Australia_Thirsty_Camels.sff_300.jpg



On my middle-of-the-line Windows XP laptop, IE 7 takes about 10 minutes to execute stripUrlPrefix(), given this input string; Firefox just pegs the CPU and never does return. Jason is going to give this code a spin on Chrome to see what happens.

I have somehow stumbled into some kind of backtracking morass with a regex that looks pretty vanilla to me, and an input string that's likewise not too gnarly.

It turns out that we can fix the problem by trimming leading whitespace from the input and adding a beginning of string anchor to the regular expression, thus:



var URL_REGEX = /^(http:\/\/)?(\w+\.{0,1})*client\.org/i;



I haven't checked to see whether explicitly using the RegExp class would make a difference.

No comments: