Perl Regular Expressions
Regular Expressions
Regular expressions are patterns to be matched against a string. The two basic operations performed using patterns are matching and substitution:
Matching -->/pattern/
Substitution --> s/pattern/newstring/
The simplest kind of regular expression is a literal string. More complicated expressions include metacharacters to represent other characters or combinations of them.
The […] construct is used to list a set of characters (a character class) of which one will match. Ranges of characters are denoted with a hyphen (-), and a negation is denoted with a circumflex (^). Examples of character classes are shown below:
[a-zA-Z] -->Any single letter
[0-9] --> Any digit
[^0-9] --> Any character not a digit
Some common character classes have their own predefined symbols:
Code | Matches |
---|---|
. | Any character |
\d | A digit, such as [0-9] |
\D | A nondigit, same as [^0-9] |
\w | A word character (alphanumeric) [a-zA-Z_0-9] |
\W | A nonword character [^a-zA-Z_0-9] |
\s | A whitespace character [ \t\n\r\f] |
\S | A non-whitespace character [^ \t\n\r\f] |
Regular expressions also allow for the use of both variable interpolation and backslashed representations of certain characters:
Code | Matches |
---|---|
\n | Newline |
\r | Carriage return |
\t | Tab |
\f | Formfeed |
\/ | Literal forward slash |
Anchors don’t match any characters; they match places within a string.
Assertion | Meaning |
---|---|
^ | Matches at the beginning of stringMatches at the beginning of string |
$ | Matches at the end of string |
\b | Matches on word boundary |
\B | Matches except at word boundary |
\A | Matches at the beginning of string |
\Z | Matches at the end of string or before a newline |
\z | Matches only at the end of string |
Quantifiers are used to specify how many instances of the previous element can match.
Maximal | Minimal | Allowed Range |
---|---|---|
{n,m} | {n,m}? | Must occur at least n times, but no more than m times |
{n,} | {n,}? | Must occur at least n times |
{n} | {n}? | Must match exactly n times |
* | *? | 0 or more times (same as {0,}) |
+ | +? | 1 or more times (same as {1,}) |
? | ?? | 0 or 1 time (same as {0,1}) |
It is important to note that quantifiers are greedy by nature. If two quantified patterns are represented in the same regular expression, the leftmost is greediest. To force your quantifiers to be non-greedy, append a question mark. If you are looking for two possible patterns in a string, you can use the alternation operator (|). For example,
/you|me|him|her/;
will match against any one of these four words. You may also use parentheses to provide boundaries for alternation:
/And(y|rew)/;
will match either “Andy” or “Andrew”.
Parentheses are used to group characters and expressions. They also have the effect of “remembering” parts of a matched pattern for further processing. To recall the “memorized” portion of the string, include a backslash followed by an integer representing the location of the parentheses in the expression:
/fred(.)barney\1/;
Outside of the expression, these “memorized” portions are accessible as the special variables $1, $2, $3, etc. Other special variables are as follows:
$& Part of string matching regexp
$` Part of string before the match
$’ Part of string after the match
Regular expression grouping precedence
Parentheses () (?: )
Quantifiers ? + * {m,n} ?? +? *?
Sequence and abc ^ $ \A \Z (?= ) (?! )
anchoring
Alternation |
To select a target for matching/substitution other than the default variable ($_), use the =~ operator:
$var =~ /pattern/;