Regex for GSM 03.38 7bit character set

If your application sends SMS you want to make sure that characters to be sent comply with GSM 03.38 (e.g. 7bit).

If you need to validate (user) input for invalid characters you could do that pretty easily with regular expressions. For Java the following regex is ready to use. For other languages you can base your implementation on the “pure” unescaped regex shown in the first line of the code comment.

/*-
 * ^[A-Za-z0-9 \r\n@£$¥èéùìòÇØøÅå\u0394_\u03A6\u0393\u039B\u03A9\u03A0\u03A8\u03A3\u0398\u039EÆæßÉ!"#$%&'()*+,\-./:;<=>?¡ÄÖÑܧ¿äöñüà^{}\\\[~\]|\u20AC]*$
 *
 * Assert position at the beginning of the string «^»
 * Match a single character present in the list below «[A-Za-z0-9 \r\n@£$¥èéùìòÇØøÅå\u0394_\u03A6\u0393\u039B\u03A9\u03A0\u03A8\u03A3\u0398\u039EÆæßÉ!"#$%&'()*+,\-./:;<=>?¡ÄÖÑܧ¿äöñüà^{}\\\[~\]|\u20AC]*»
 *    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
 *    A character in the range between "A" and "Z" «A-Z»
 *    A character in the range between "a" and "z" «a-z»
 *    A character in the range between "0" and "9" «0-9»
 *    The character " " « »
 *    A carriage return character «\r»
 *    A line feed character «\n»
 *    One of the characters "@£$¥èéùìòÇØøÅå" «@£$¥èéùìòÇØøÅå»
 *    Unicode character U+0394 «\u0394», Greek capital Delta
 *    The character "_" «_»
 *    Unicode character U+03A6 «\u03A6», Greek capital Phi
 *    Unicode character U+0393 «\u0393», Greek capital Gamma
 *    Unicode character U+039B «\u039B», Greek capital Lambda
 *    Unicode character U+03A9 «\u03A9», Greek capital Omega
 *    Unicode character U+03A0 «\u03A0», Greek capital Pi
 *    Unicode character U+03A8 «\u03A8», Greek capital Psi
 *    Unicode character U+03A3 «\u03A3», Greek capital Sigma
 *    Unicode character U+0398 «\u0398», Greek capital Theta
 *    Unicode character U+039E «\u039E», Greek capital Xi
 *    One of the characters "ÆæßÉ!"#$%&'()*+," «ÆæßÉ!"#$%&'()*+,»
 *    A - character «\-»
 *    One of the characters "./:;<=>?¡ÄÖÑܧ¿äöñüà^{}" «./:;<=>?¡ÄÖÑܧ¿äöñüà^{}»
 *    A \ character «\\»
 *    A [ character «\[»
 *    The character "~" «~»
 *    A ] character «\]»
 *    The character "|" «|»
 *    Unicode character U+20AC «\u20AC», Euro sign
 * Assert position at the end of the string (or before the line break at the end of the string, if any) «$»
 */
public static final String GSM_CHARACTERS_REGEX = "^[A-Za-z0-9 \\r\\n@£$¥èéùìòÇØøÅå\u0394_\u03A6\u0393\u039B\u03A9\u03A0\u03A8\u03A3\u0398\u039EÆæßÉ!\"#$%&'()*+,\\-./:;<=>?¡ÄÖÑܧ¿äöñüà^{}\\\\\\[~\\]|\u20AC]*$";

Geeh, who allowed the Greek to smuggle part of their alphabet into GSM 03.38? Since those characters don’t fit into Latin1 (ISO-8859-1) they should be UTF-8 encoded in the regex. More on that in this excellent regex tutorial: http://www.regular-expressions.info/unicode.html. Oh yes, and I do recommend using RegexBuddy – it really is my regex life-saver.

4 thoughts on “Regex for GSM 03.38 7bit character set

Leave a Reply