Perl
Quick Reference for Integration Services
Regular Expressions
Regular Expressions are the standard way in
the UNIX world to match patterns. While
regular expressions differ slightly amongst the UNIX tools, the Perl set of
regular expressions is the most powerful and complete.
Each character matches itself except for the
special characters: +?.*^$()[]{}|\
The special meaning of these characters can
be escaped using a \
|
. |
Matches an arbitrary character |
|
(...) |
Groups a series of pattern elements to a single element |
|
^ |
Matches the beginning of the target |
|
$ |
Matches the end of the line |
|
[...] |
Denotes a class of characters to match. [^...] negates the class |
|
(...|...|...) |
Matches one of the alternatives. |
|
(?: regex) |
Grouping without back-references.
Used to group without storing results in variables $1..$9 |
|
+ |
Matches the preceding pattern element one or more times |
|
* |
Matches the preceding pattern element zero or more times |
|
? |
Matches the preceding pattern element zero or one times |
|
{n,m} |
Denotes the minimum n and maximum m match count. {n} means exactly n times; {n,} means at least n times; {n,m} means between n and m times. |
|
\w |
Matches word characters, i.e. alphanumeric including _. \W matches
non-alphanumerics. |
|
\s |
Matches whitespace. \S matches non-whitespace. |
|
\d |
Matches digits. \D matches non-digits. |
|
\b |
Matches word boundaries. |
|
\t |
tab |
|
\n |
newline |
|
\r |
carriage return |
|
\f |
formfeed |
|
\0XX |
octal |
|
\xXX |
hex |
|
$1..$9 |
Refer to matched subexpressions grouped with (...) |
Regular expression modifiers:
|
g |
Matches as many times as possible |
|
i |
Case-insensitive matching |
|
m |
Treats the string as multiple lines |
|
s |
Treats the string as a single line |
|
x |
Comments and whitespace can be added to pattern for readability |
Matching, Searching and Replacing, Transliterating:
[expr=~][m]/pattern/modifiers
Returns true or false depending on whether or
not the pattern matched. Searches expr (default: $_) for the pattern. If you prepend the m you can use almost any
pair of delimiters instead of the slashes.
This is useful if you are going to have lots of slashes in your
pattern—avoids having to escape them all.
Most common alternative delimiters are {}, [], and
##.
[$var=~]s/pattern/newtext/modifiers
Searches the string var (default: $_) for a pattern, and if
found, replaces that part with the replacement text. It returns the number of substitutions made. Almost any delimiter may replaces the
slashes. If bracketing delimiters are
used pattern and newtext may have their own delimiters,
e.g., s(foo)[bar]
[$var=~]tr/searchlist/replacementlist/modifiers
Transliterates all occurrences of the
characters found in the search list with corresponding characters in the
replacement list. The d modifier deletes all characters found in the search
list that do not have a corresponding character in the replacement list.
Examples:
Please send me (Jamin) more examples of
regex’s you come up with or examples you’d like me to come up with.
In each example the string that contains the
data we’re interested in is always in $_.
Example 1:
A physician name is in the format:
LAST, FIRST MIDDLE
And we’d like to extract each part of the
name.
if (/(\w+),\s*(\w+)\s+(\w+)/) {
$last = $1;
$first = $2;
$middle = $3;
}
So we’re capturing a word followed by a comma, followed by some
optional whitespace, then we’re capturing another word followed by some
whitespace and then we’re capturing one more word.
Example 2:
Let’s say instead of extracting
each name into a variable we’d like to just reformat:
LAST, FIRST MIDDLE
to
LAST^FIRST^MIDDLE
tr/ ,/^/d;
That will transliterate spaces
into ^ and will delete commas since we added the d modifier.
Example 3:
A small program to mark up code for posting
to the web. Escapes characters that
have special meaning in HTML
#!/usr/bin/perl -w
print ”<pre><code>\n”;
while (<>) {
s/&/&/g;
s/</</g;
s/>/>/g;
print;
}
print ”</pre></code>\n”;
Quick Examples :
Keep the first five characters of a string :
$first = substr($_, 0,
5);
Keep the last five characters of a string:
$last = substr($_, -5);
Keep all characters up to the first ^:
s/(.*)\^/$1/;
Remove leading zeros:
s/^0*//;
Remove leading spaces:
s/^ *//;
Remove trailing spaces:
s/ *$//;
Reformat a phone number (123) 456‑7890 to 1234567890:
tr/()- //d;
Reformat a phone number 1234567890 to (123) 456‑7890:
s/(\d{3})(\d{3})(\d{4})/($1) $2-$3/;
Reformat SSN 123‑45‑6789 to 123456789:
tr/-//d;
Reformat SSN 123456789 to 123-45-6789:
s/(\d{3})(\d{2})(\d{4})/$1-$2-$3/;
Keep only alpha chars:
s/[^A-Za-z]//g;
Keep only numeric chars:
s/\D//g;
Left Pad number with zeros to 10 places:
$_ = sprintf(“%010d”,
$_);
Right Pad string with spaces to 10 places
$_ = sprintf(“%-10s”,
$_);