SakhaliaNetHome PageHistory of the RailwayVorKutaAcceptance of cookiesAcceptance of cookies

PHP Tutorial :: Regex (I)

PHP Example #117

Introduction

Regular expressions enclose a complex and powerful world of matching patterns to ensure that input values have the form that we want. They allow to discern if an user has entered a valid ZIP code or phone number in a form application, or find all the <a> tags in a HTML document. If a website is based in data that is managed via text files, such as classifications, news or articles, regular expressions help to give sense to that data. This tutorial covers the fundamental and most useful aspects of regular expressions, which will solve the most part of the text processing problems that the average PHP website can generate.

A regular expression is a string that defines a pattern that matches with other strings. For example, the regex \d{5}(-\d{4})? matches with the American ZIP code that is composed of five mandatory digits plus four optional digits. This particular example of regex works like this:

\d matches with any digit from 0 to 9

{5} forces to include 5 units of the previous element

- is a literal character, it just matches with itself

\d matches with any digit from 0 to 9

{4} forces to include 4 units of the previous element

( )? renders as optional anything that is enclosed within

So this expression literally means that the string to process must consist of five digits optionally followed by a hyphen and four digits. Here is another regex to study: </?[bBiI]>

< is a literal character, it just matches with itself

/ is a literal character, it just matches with itself

? renders as optional anything that goes before it

[bBiI] matches with anything that is enclosed within the brackets

> is a literal character, it just matches with itself

And this expression matches with any of these HTML tags: <b>, <B>, </b>, </B>, <i>, <I>, </i> and </I>.

PHP Example #118

Characters and metacharacters

Those characters that in a regular expression only match with themselves are called literals, while those that have a special meaning are called metacharacters. A pattern that only includes literal characters matches with strings that contain the sequence of literals as it is. For example, the pattern href= matches with any string that contains href= on it, no matter on which position of the string the pattern is found or if the string is simply equal to the pattern.

The metacharacter . matches with any character, but this is not totally true, since by default it does not matches a break line character; however, by activating the pattern modifier s we will make the metacharacter . to match with a break line. Therefore, the pattern d.g would match with occurrences like "dog", "adagio", "digdug", "*d*g*" and so on... and also with "d.g", since the metacharacter . matches with its literal equivalent. Without using a quantifier, the metacharacter . matches a single character, so d.g would not match with words like "ridge" (no characters between d and g) or "doug" (more than one character between d and g).

The metacharacter | is for alternative patterns, this is, for building patters that match with more than a set of characters. For example, dog|cat matches with strings that contain "dog" or "cat", such as "dog", "cathode", "redogame" or "hotdog". The text for alternatives usually includes everything up to the start and everything up to the end; however, the scope of the alternatives can be restricted by placing the options within parentheses. For example, s(cr|in)ew would mean something like "matches with s, then with cr or in, and then with ew". Therefore, this regular expression would match with strings such as "screw" or "sinew", but not with strings such as "screen" or "deminews". The alternatives can be used with more than two options, by building a expression like this: s(cr|in|tr|ch)ew, that would match with "screw", "sinew", "strew" or "eschew".

Using parentheses to group characters for the alternatives is called grouping, which is applied as well to quantifiers. Parentheses are used as well to capture the text inside them for subsequent uses. The characters that match with the part of the pattern that is inside a set of parentheses are stored in a special variable that allows to recover them at a later stage.

PHP Example #119

Quantifiers

A quantifier is a metacharacter that is placed after a certain character to indicate how many times that character has to match. The different quantifiers are:

* (Zero or more times)

+ (One or more times)

? (Optional, zero or one times)

{x} (Exactly x times)

{x,} (At least x times)

{x,y} (At least x times, but not more than y times)

Here are some examples of regular expressions that use quantifiers:

Expression Meaning Matches with Doesn't match with
ba+ b, then at least one of a ba, baa, baaa, rumba, babar b, abs, taaa-daaa, celeste
ba+na*s b, then at least one of a, then n, then zero or more of a, then s turbans, baanas, rumbanas banana, bananas
ba(na){2} ba, then two of na banana, bananas, semibanana, bananarama canaba, banarama
ba{2,}ba{3,} b, then at least two of a, then b, then at least three of a baabaaa, baaaaabaaaaa, rumbaabaaas baabaa, babaaar, banana
(baa-){2,4}baa At least two but not more than four of baa-, then baa baa-baa-baa, baa-baa-baa-baa-baa, oomp-pa-pa-baa-baa-baa-oomp-pa-pa baa-baa, baa-baad-news
dogs? and cats? ( and chickens?)? dog, then one optional s, then and cat, then one optional s, then one optional and chicken or and chickens dog and cat and chicken, dog and cat and chickens, hotdogs and cats, dogs and cat and chickens, dog and cats and chicken, dog and cat and chickensoup doggies and cats, dogs and cats or chickens, dogss and catss, dog and cat and chickenlegs

PHP Example #120

Anchors

Anchors align a pattern for a more specific match. A pattern like ba(na)+ matches with "banana" but also with "cabana" or "bananarama". Everytime the pattern ba(na)+ is found in any place on a string, the pattern matches. However, an anchor matches with a pattern at the beginning or the end of a string. The anchor ^ matches with the beginning of a string and the anchor $ matches with the end of a string. For example, the pattern ^Gre matches with strings that start by Gre, such as "Green", "Grey Lantern" or "Grep is my favorite". On the other hand, the pattern !$ matches with any string that ends with an exclamation sign, such as "Zip!" or "Pow! Kablam!".

Both anchors can be used in the same pattern to match certain strings. The pattern ^ba(na)+ matches with "banana" and "bananarama", but not with "cabana", and the pattern ba(na)+$ matches with "banana" and "cabana" but not with "bananarama". With anchors in every end, we have the expression ^ba(na)+$ which matches only with "banana" and derivatives like "bananana" or "banananana".

The pattern ^(w|W|b|B)illy?$ matches various names that are equivalent of William: "Will", "will", "Bill", "bill", "Willy", "willy", "Billy" and "billy". Apart from the anchors ^ and $, there exist anchor metacharacters that treat word limits. The anchor \b matches with a word limit while \B matches with anything that is not a word limit. A word limit is found between a character that is a letter, a digit or an underscore and another character that is nothing of that. For example, the pattern \b[fF]ish matches with the string "fish" only when it is not part of a compound word, so it would match with "fish", "Go fish!" and "Hamilton Fish High School", but not with "bluefish", "sportfishing" or "swordfish". Still, it would match with "sport-fishing", for the hyphen generates a word limit.

PHP Example #121

Classes of characters

A class of characters allows to represent a set of characters as a single element in a regular expression. The set of characters that we want to turn into a class is enclosed within square brackets. A class of characters will match with any character in a string that is included in the class. For example, the pattern ^p[eo]pa$ matches with either "pepa" or "popa", because the class [eo] matches either with "e" or "o".

To place a complete set of characters inside a class, we have to include just the first and the last character of the class separated by a hyphen. For example, the pattern [a-zA-Z] would match any of the characters of the alphabet. When a hyphen is used to represent a scope in a class, the range of characters includes all the characters whose ASCII values are between the first and last character, these included. If we want to include a literal hyphen inside a class, we have to escape it with a backslash. So while [a-z] matches any lowercase letter in the alphabet, [a\-z] matches only "a", "-" and "z".

It is also possible to create a negative class, that matches with any character that is not included in the class. To create a negative class, we have to start it with ^. For example, the pattern [^a-zA-Z] matches with any character that is not a letter. If we want to use the character ^ as literal inside the class, we have to escape it or otherwise place it after the first position. For example, the pattern [0-9][%^][0-9] is the same that [0-9][\^%][0-9], a pattern that matches strings such as "5^5" or "3%2".

Classes are more effective than alternative patterns when we are choosing between individual characters. For example, instead of s(a|o|i)p, we can use s[aoi]p, which gives the same results, matching "sap", "sop" and "sip". Some classes are represented with dedicated characters, that are more concise than specifying every character in the class. These dedicated characters are shown in the table below:

Metacharacter Description Equivalent class
\d Digits [0-9]
\D Any character that is not a digit [^0-9]
\w Word characters [a-zA-Z0-9_]
\W Any character that is not a word character [^a-zA-Z0-9_]
\s Whitespaces [ \t\n\r\f]
\S Any character that is not a whitespace [^ \t\n\r\f]

These metacharacters can be used as classes, as in this pattern that matches with the 24 hours of a clock: ([0-1]\d|2[0-3]): [0-5]\d
They can be included as well inside a class of characters together with another characters, as in this pattern that matches with hexadecimal numbers: [\da-fA-F]+