Regular Expressions

Here is a topic that has really flustered a lot of developers. Regular expressions is a concept that can be hard to get a real handle on. PHP has a couple of functions that can help do regular expressions. The one I focus on most is using the function:
preg_match()

This is a very useful tool, and if you look at the PHP manual for ereg(), it states that the function “preg_match” is a faster alternative to “ereg()”. Now while I am not going to get into the details of the speed and response times for both functions, as there will always be someone with a different opinion or case that shows how their way is better, and that is fine. What most people have a hard time dealing with is getting the actual match to do what is needed. There are times when It is just easier to do a Google search and get some code that someone else has already done and plug it in. But the real power is knowing what you are doing first, that way you can build your own.

For this example, we can take a look at CakePHP’s own little validation object. When you set up a model and add some validation to it, it calls this object. Based on the data that this going into the tables, it will call one of these functions. The way these functions work is by checking the input for a specific character list/set that should be contained in the text. If the entry does not match up, then it is not validated. The way CakePHp does this is by using the preg_match() function.

If you are new to regular expressions, then seeing something like this:

define('VALID_EMAIL', "/^[a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+(?:[a-z]{2,4}|museum|travel)$/i");

may be a little scary. But never fear, this is not as bad as it seems. It just looks real scary. And besides, even Cake made it a little better.

So let’s look at this function:

function email($check, $deep = false, $regex = null) {
. . . 
	if (is_null($_this->regex)) {
		$_this->regex = '/^[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+)*@' . $_this->__pattern['hostname'] . '$/i';
	}
	$return = $_this->_check();
. . . 
}

function _check() {
. . . 
	if (preg_match($_this->regex, $_this->check)) {
		$_this->error[] = false;
		return true;
	} else {
		$_this->error[] = true;
		return false;
	}
}

If you would like to know more about the function, then please browse to the CakePHP manual to get more info about that function, since all this is going to do is point out the regex part of it.

First off, the function called is “email”. In this function, there is a regex match set that is the following:
‘/^[a-z0-9!#$%&\’*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&\’*+\/=?^_`{|}~-]+)*@’ . $_this->__pattern[‘hostname’] . ‘$/i’

It passes that to the _check() function, which puts this regex pattern to the email address that is entered to see if there is a match. So to jump ahead if someone were to put in an email address of:
– flavio@nothing.com
It would pass that function. (NOTE: this does not mean it is a valid email address that can recieve emails, it just means that the characters in the email address are in the valid format of (address_name)@(hostname).(Extension) )

But this also means that someone could put in the email address:
– flavio@nothing.show
And that would pass as well, while the following email addresses:
– flavio
– flavio@nothing
– flavioATnothingDOTcom
would fail.

But now that we know that, how did we get there?

Let’s break down the match.

The first thing, we are looking for a pattern, so we need to ad the following:
/ /
around the pattern. This is a Perl syntax that is followed for finding patterns. Now the bookends come, with the caret ( ^) and the dollar sign ( $ ). The caret means to search the beginning of the string for the pattern, and the dollar sign matches the end of the string. In this example, the caret matches the first part of the email address entered, and the dollar sign matches the end.

'/^[ ]$/'

So we are off to a good start. but there since we are looking for a pattern that does not need to be case sensitive, as we do not care if there are uppercase letters or not, we need to add an “i” at the end.

'/^[ ]$/i'

Now we are ready to start looking at the beginning of the string. We are going to put this in a bracket to group the characters, or create a class. We want to get any valid characters for an email address. This would include any letters, numbers and some special characters. We will need to escape some of these (using the “\” to escape)

'/^[a-z0-9!#$%&\'*+\/=?^_`{|}~-]$/i'

This will now look for the address name, sort of. As long as there are no periods in the address account name, it will work, but there are addresses out there that have a “.” in it:
– flavio.elguappo@nothing.com
would not pass the validation as of yet (given that we would have also added the domain by the time it is checked).

So we need to add that match set in to the expression. CakePHP does this by using an atomic grouping. Using the parenthesis usually means that a “backreference” should be done. You can escape this by using “?:”. The question mark-colon combo after the first parenthesis signifies that what is coming is not a backreference. Now, to get the addresses that have a period in the account name we add this:
(?:\.[a-z0-9!#$%&\’*+\/=?^_`{|}~-]+)

As you can see, this is almost the same pattern as before. We want to check the same things again after any appearance of a period. So now it looks like:

'/^[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+)$/i'

Last things here, we need to account for the @ symbol in the address, and add the domain name. Since CakePHP already takes care of that, they have there own addition:

'/^[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+)*@' . $_this->__pattern['hostname'] . '$/i'

And the final php code would be as follows:

$checked = preg_match('/^[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&\'*+\/=?^_`{|}~-]+)*@' . $_this->__pattern['hostname'] . '$/i', $email_address_to_check);

return $checked;

Now I am sure I glossed over some regex rules and explanations. I will never claim to be an expert on regex, as I am still learning as much as I can about this. Now this is not the only way to check an email address. If you do not use CakePHP and the real nifty built in helper, then you can use this one:

"/^[^0-9][A-z0-9_]+([.][A-z0-9_]+)*[@][A-z0-9_]+([.][A-z0-9_]+)*[.][A-z]{2,4}$/"

This one is similar, check the account name for valid characters, see if it has any periods in the name, check the @ symbol and then check the domain name, and then the domain extension, allowing for 2-4 characters in the extension.

And if you want to get some real good info on other regex info, here are a couple of links I found useful:
http://www.webcheatsheet.com/php/regular_expressions.php
http://www.regular-expressions.info/tutorial.html
http://us3.php.net/manual/en/function.preg-match.php
(Make sure you read the comments as they have some good info)