Sunday, May 11, 2008

Introduction to Regular Expression

Regular expression is a powerful tool for handling string. For example, it is often used to validate Email, find all URLs in an HTML file, etc. Many programming languages support regular expresiong, such as Java, Python, PHP. Some languages, like C and C++, don't support it though, you may find some libraries (such as Boost) to handle regular expression instead. In this section, I don't want to introduce how to use regular expression in each programming language. In fact, regular express is in common use. Now let's look at the syntax of regular expression. To see clearly, I use '/.../' to describe a regular expression and '...' to describe a normal string. The symbol '~' means 'match'.

Matching strings is the goal which regular expression is designed for. For example, '/ea/' can match all strings contain the characters 'ea', such as 'please', 'reason', 'easy'. The dot '.' can match any character. When some characters are put in '[]', it means 'or'. See below:
'/[bcm]at/' ~ 'bat', 'cat', 'mat', but not 'bmat'. i.e. characters in [] can be matched only once.
'/[a-z]/' matches all strings that contain lower letter. Here, the minus symbol '-' means 'to'. It describes a range.

Let me take more useful examples:

How to validate an IP address? We know that IP address can be expressed as xxx.xxx.xxx.xxx. Here, 'x' stands for a digit. We can use '/([0-9]{1,3}){4}/' to match IP addresses. However, there are some special characters need to be explained. First, '[0-9]{1,3}' means a digit can be repeated one to three times. i.e. it (the subexpression) can match '1','12','123', etc. Second, the parentheses mean 'a group', because we want the subexpression '[0-9]{1,3}' be applied four times. Up to now you know that brackets specify the repeat times. Moreover, {1,} means repeat one or more times. So you can use '/[1-9][0-9]{0,}/' to match any positive integer.

How to validate an email address? An email address consists of an account name, a symbol '@' and a domain name. It's a little harder work. Let's see the regular expression first: '/^[a-zA-Z0-9\-.]+@[a-zA-Z0-9\-]+\.[a-zA-Z0-9\-.]+$/'. It contains more special characters. the symbol '^' out of [] means it can one match the head of the string. For example, '/^ea'/ matches 'easy' but not 'reason'. By the way, '^' in the [] has another meaning. I will explain it after a while. The dot '.' in [] doesn't stand for any character. In details, if a dot is the first or the last character in [], it just stands for a dot '.'. The plus symbol '+' means 'repeat one or more times'. It is equivalent to {1,}. There is another symbol '*' similar to '+', means 'repeat zero or more times'. The last symbol in the regular express '$', has the opposite meaning of '^'. It means that it matches the tail of the string only. For example, '/ing$/' mathes 'doing', 'seeing', but not 'surprisingly'.

Just now I said '^' in the [] has another meaning. What it it? It means 'not contain'. For example, '/[^0-9]/' can match any non-digit character.

OK, it is just an introduction to regular expression. Regular expression is very powerful tool and it can be very complex. Even you can use it to parse programming code. Now many text editor provides syntax highlight function. It can be implemented with regular expression. It is necessary skill to a programmer.

No comments: