Home : Internet : Server : Apache : Mod Rewrite : Regular Expressions

Regular Expressions

Regular expressions (commonly abbreviated to regex) are widely used throughout computing. A regex pattern describes a set of strings that match the regex syntax.

You may be familiar with using a wildcard * to mean "anything can be here". A regex is a much more powerful extension of that - rather than "anything", we can specify what possible values that can take and where in the string the unknown is found.

In part one of the rewrite tutorial, we used an article script as an example. Assume we have configured our script to output URLs in the format /articles/24-article-title-here.html but we need that rewritten to /articles.php?id=24. Before we can do anything, we need a regex pattern that matches this string:

articles/24-article-title-here.html

That solves that problem1, but only when we view article number 24. It is a regex pattern - most characters are treated as literals, or in other words they only match themselves. An "a" in a regex will match an "a", a "b" in a regex will match a "b", and so on. However, that alone would be fairly useless and this is where our special characters come in. These are known as metacharacters and have a special meaning, as described below.

Syntax

Values:

Anchors:

"pie" alone will match "I like pie", "pie is good" or "Chicken pie no thanks".

"^pie" will match "pie is good" and any other string starting with "pie".

"pie$" will match "I like pie" and any other string ending with "pie".

"^pie$" will match only the string "pie".

Quantifiers:

Up to now, we have only matched one character at any time.

The rest:

The pipe symbol | has meaning "OR". "gray|grey" matches "gray" or "grey".

Parenthesis, or brackets, can be used to group expressions. The above example could be simplified to "gr(e|a)y".

Escape Special Meaning

Since the characters .+*^$[]() all have a special meaning in a regex pattern, if you wish to match literally a period for example, you must escape it. ".html" matches any character followed by "html". "\.html" matches only ".html".

Example

Going back to our example, we can now create a much more useful pattern that matches any possible ID number and any possible title in the URL.

^articles/([0-9]+)-(.*)\.html$

That may look a little harder than our original static pattern but if we go through it bit by bit, it makes perfect sense.

First, we make sure the request started with articles/. Then we match the number, by using a range 0-9 and repeating one or more times which will allow us to match any possible numeric value there. Note this is in parenthesis so that we can use it as a backreference (explained in the rewrite tutorial but basically we want to use the value matched there). We then separate with a dash, match any set of characters in the title and end with a .html extension.

Easy?