Regex for GSM 03.38 7bit character set

If your application sends SMS you want to make sure that characters to be sent comply with GSM 03.38 (e.g. 7bit).

If you need to validate (user) input for invalid characters you could do that pretty easily with regular expressions. For Java the following regex is ready to use. For other languages you can base your implementation on the “pure” unescaped regex shown in the first line of  the code comment.

Geeh, who allowed the Greek to smuggle part of their alphabet into GSM 03.38? Since those characters don’t fit into Latin1 (ISO-8859-1) they should be UTF-8 encoded in the regex. More on that in this excellent regex tutorial: Oh yes, and I do recommend using RegexBuddy – it really is my regex life-saver.

Oracle regular expression aka regexp, issues with character class

In most programming languages the regular expression pattern to find the digit ’1′ surrounded by ‘;’ and other digits would be something like

So, the pseudo character class “; or digit” is matched zero or more times, then the digit 1 is matched followed by zero or more “; or digit”s. A few examples:
With Oracle SQL, however, it’s a slightly different story. \d is not supported i.e. not properly recognized as being the character class for digits. However, the character class 0-9 which generally is the equivalent to \d seems to be supported. In Oracle you could therefore use
As far as I can tell this is an undocumented feature. The official Oracle regexp documentation only mentions that it supports the regular POSIX character class [:digit:]. Watch out, the equivalent to \d is the whole expression [:digit:] and not just :digit:. I was first fooled by the extra [] around the character class designator… So, according to the documentation you’d have to use

Regex: match last occurrence

Today, I found myself looking for a regular expression that matches only the last occurrence of a given expression. As I’m still not a regex mastermind I couldn’t come up with it just like that.

The key to the solution is a so called “negative lookahead“. A lookahead doesn’t consume characters in the string, but only asserts whether a match is possible or not. So if you wanted to extract the last “foo” in the text “foo bar foo bar foo” your regex would look like this:


If you used the DOTALL option the above expression would even work correctly on a multi-line text such as


Of course the example is not taken from a real life scenario as it doesn’t matter which “foo” is matched as they’re all the same anyway. The expression would with no doubt be more complicated, but I hope you get the point.


Someone asked for an explanation…Here’s what RegexBuddy, my indispensable regex tool, produces automatically:
# foo(?!.*foo)
# Match the characters “foo” literally «foo»
# Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*foo)»
#    Match any single character that is not a line break character «.*»
#       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
#    Match the characters “foo” literally «foo»