Perl Quick Reference for Integration Services

 

Regular Expressions

 

Regular Expressions are the standard way in the UNIX world to match patterns.  While regular expressions differ slightly amongst the UNIX tools, the Perl set of regular expressions is the most powerful and complete. 

 

Each character matches itself except for the special characters: +?.*^$()[]{}|\

The special meaning of these characters can be escaped using a \

 

.

Matches an arbitrary character

(...)

Groups a series of pattern elements to a single element

^

Matches the beginning of the target

$

Matches the end of the line

[...]

Denotes a class of characters to match.  [^...] negates the class

(...|...|...)

Matches one of the alternatives.

(?: regex)

Grouping without back-references.  Used to group without storing

results in variables $1..$9

+

Matches the preceding pattern element one or more times

*

Matches the preceding pattern element zero or more times

?

Matches the preceding pattern element zero or one times

{n,m}

Denotes the minimum n and maximum m match count.  {n} means exactly n times; {n,} means at least n times;  {n,m} means between n and m times.

\w

Matches word characters, i.e. alphanumeric including _.  \W matches non-alphanumerics. 

\s

Matches whitespace.  \S matches non-whitespace.

\d

Matches digits.  \D matches non-digits.

\b

Matches word boundaries. 

\t

tab

\n

newline

\r

carriage return

\f

formfeed

\0XX

octal

\xXX

hex

$1..$9

Refer to matched subexpressions grouped with (...)

 

Regular expression modifiers:

 

g

Matches as many times as possible

i

Case-insensitive matching

m

Treats the string as multiple lines

s

Treats the string as a single line

x

Comments and whitespace can be added to pattern for readability

 

 

 

 

 

 

Matching, Searching and Replacing, Transliterating:

 

[expr=~][m]/pattern/modifiers

 

Returns true or false depending on whether or not the pattern matched.  Searches expr (default: $_) for the pattern.  If you prepend the m you can use almost any pair of delimiters instead of the slashes.  This is useful if you are going to have lots of slashes in your pattern—avoids having to escape them all.  Most common alternative delimiters are {}, [], and ##. 

 

[$var=~]s/pattern/newtext/modifiers

 

Searches the string var (default: $_) for a pattern, and if found, replaces that part with the replacement text.  It returns the number of substitutions made.  Almost any delimiter may replaces the slashes.  If bracketing delimiters are used pattern and newtext may have their own delimiters, e.g.,     s(foo)[bar]

 

[$var=~]tr/searchlist/replacementlist/modifiers

 

Transliterates all occurrences of the characters found in the search list with corresponding characters in the replacement list. The d modifier deletes all characters found in the search list that do not have a corresponding character in the replacement list. 

 

 

Examples:

 

Please send me (Jamin) more examples of regex’s you come up with or examples you’d like me to come up with.

 

In each example the string that contains the data we’re interested in is always in $_. 

 

Example 1:

 

A physician name is in the format:

 

LAST, FIRST MIDDLE

 

And we’d like to extract each part of the name.

 

if (/(\w+),\s*(\w+)\s+(\w+)/) {

      $last   = $1;

      $first  = $2;

      $middle = $3;

}

 

So we’re capturing  a word followed by a comma, followed by some optional whitespace, then we’re capturing another word followed by some whitespace and then we’re capturing one more word. 

 

Example 2:

 

Let’s say instead of extracting each name into a variable we’d like to just reformat:

 

LAST, FIRST MIDDLE

 

to

 

LAST^FIRST^MIDDLE

 

tr/ ,/^/d;

 

That will transliterate spaces into ^ and will delete commas since we added the d modifier.

 

 

 

 

 

 

 

 

 

Example 3:

 

A small program to mark up code for posting to the web.  Escapes characters that have special meaning in HTML

 

#!/usr/bin/perl -w

 

print ”<pre><code>\n”;

while (<>) {

      s/&/&amp;/g;

      s/</&lt;/g;

      s/>/&gt;/g;

      print;

}

print ”</pre></code>\n”;

 

Quick Examples :

 

Keep the first five characters of a string :

$first = substr($_, 0, 5);

 

Keep the last five characters of a string:

$last = substr($_, -5);

 

Keep all characters up to the first ^:

s/(.*)\^/$1/;

 

Remove leading zeros:

s/^0*//;

 

Remove leading spaces:

s/^ *//;

 

Remove trailing spaces:

s/ *$//;

 

Reformat a phone number (123) 456‑7890 to 1234567890:

tr/()- //d;

 

Reformat a phone number 1234567890 to (123) 456‑7890:

s/(\d{3})(\d{3})(\d{4})/($1) $2-$3/;

 

Reformat SSN 123‑45‑6789 to 123456789:

tr/-//d;

 

Reformat SSN 123456789 to 123-45-6789:

s/(\d{3})(\d{2})(\d{4})/$1-$2-$3/;

 

Keep only alpha chars:

s/[^A-Za-z]//g;

 

Keep only numeric chars:

s/\D//g;

 

Left Pad number with zeros to 10 places:

$_ = sprintf(“%010d”, $_);

 

Right Pad string with spaces to 10 places

$_ = sprintf(“%-10s”, $_);