28 September 2009

All that is the case

I'm working on a short piece of code to capitalize all the initials in a given sentence (headlines for news stories from a wire service, to be specific), in order to match house style. At first, I was a little surprised that I couldn't find a Java class to do most of the work for me. But once I waded into the actual sentences that I had to process, with their variations and exceptions, I came to the conclusion that some degree of RYO was called for. Here's my current draft:



import java.util.regex.*;

* * *

private String capitalizeAllInitials(String text)
{
//capitalize the first letter of every word in the passed text
//(including little words like "a," "the," "and," "to")
//also capitalize words in quoted and hyphenated phrases

//NOTES: The pattern requires a leading space; I have found that
//ingested stories already capitalize the first word of the title.

Pattern p = Pattern.compile("(-|( (`|\\\"|\\\')?))([a-z])");
Matcher m = p.matcher(text);

StringBuffer sb = new StringBuffer();
while (m.find())
{
m.appendReplacement(sb, m.group(1) + m.group(4).toUpperCase());
}
m.appendTail(sb);

return sb.toString();
}



As the comments note, this code will handle ordinary words (The quick brown fox jumps over the lazy dog becomes The Quick Brown Fox Jumps Over The Lazy Dog), hyphenated phrases (Senator proposes pay-as-you-go plan becomes Senator Proposes Pay-As-You-Go Plan), and quoted phrases (Accused confesses, "we did it" becomes Accused Confesses, "We Did It") with single, double, or backquotes. The code assumes that the first word is already capitalized, so if that's not the case with you, you would need to add a ^ to the regular expression.

This code also doesn't handle the common forms of title casing, whereby articles, prepositions, and other small words are not capitalized. Also, this code capitalizes proper names and trademarks indiscriminately.

1 comment:

David Gorsline said...

It's probably also worth noting that this code only works for 7-bit ASCII letters--no accents, no Unicode.