Perl
Please find below my highly synthetic notes on “Learning Perl” (7th, 2017, O’Reilly Media) by Randal L. Schwartz, brian d foy, and Tom Phoenix, the famous Llama book [1].
- Introduction
- Scalar Data
- Lists and Arrays
- Subroutines
- Input and Output
- Hashes
- Regular Expressions
- Matching with Regular Expressions
- Processing Text with Regular Expressions
- More Control Structures
- Perl Modules
- File Tests
- Directory Operations
- Strings and Sorting
- Process Management
- Some Advanced Perl Techniques
1. Introduction
2. Scalar Data
3. Lists and Arrays
4. Subroutines
5. Input and Output
Input from Standard Input
Obtain next line of input in scalar context:
Line-input operator <> returns undef when it reaches end-of-file. Use this to drop out of loops:
Line-input operator in list context results in all of remaining lines of input as list:
It is best to use line-input operator in a scalar context to read input line-by-line. In list context Perl fetches all input at once.
Input from the Diamond Operator
Diamond operator is a special kind of line-input operator. Input can come from the user’s choice, not only from the keyboard.
Current file name is kept in Perl’s special variable $ARGV. This name is “-” if input comes from the standard input stream STDIN.
The Double Diamond
Double diamond operator supports special characters in the filename, e.g. |. Using double diamond will avoid performing a “pipe open” and running an external program.
The Invocation Arguments
Perl stores invocation arguments in a special array @ARGV.
Use it as any other array to:
- shift items,
- iterate over it with foreach,
- check if any arguments start with a hyphen.
Use modules Getopt::Long and Getopt::Std to process options in a standard way.
Tinker with the array @ARGV after the program start and before the diamond <> invocation:
Output to Standard Output
Printing array results in a list of items, with no spaces in between:
Interpolating array prints contents of an array separated by spaces:
Default separator is a space character. The separator is stored in a special variable called $”.
Other examples include:
Perl power tools project implements all classic Unix utilities in Perl => Makes these standard utilities available on non-Unix systems [2].
Formatted Output with printf
Template string in printf holds multiple conversions (percent sign and letter):
Common conversions are:
%g | number in floating point, integer or exponential notation (chosen automatically) |
%d | decimal integer |
%x | hexadecimal |
%o | octal |
%s | string |
%f | floating-point with round off |
%% | literal percent sign |
Automatic choice of floating point, integer and exponential notation:
Decimal integer conversion truncates the number:
Hexadecimal and octal conversions are:
String conversion with width specification is:
Floating-point conversion with round off is:
Asterisk * inside format string uses its argument as width:
Use two asterisks to specify total width and number of decimal places in a float:
See sprintf documentation for more options.
Arrays and printf
To generate a format string on the fly store it in a variable first:
Filehandles
Filehandle names an I/O connection between a Perl process and the outside world.
Filehandle is not necessarily a filename.
Perl recommends to use all uppercase letters in the name of a filehandle.
Six special filehandle names reserved by Perl are:
- STDIN
- STDOUT
- STDERR
- DATA
- ARGV
- ARGVOUT
If a user calls a Perl script as:
Inside Perl script files ‘in.txt’, ‘out.txt’ and ‘err.txt’ will be available as STDIN, STDOUT and STDERR filehandles.
Opening a Filehandle
Use Perl’s open operator to create a custom connection:
Three-argument version of open is a safer option:
Specify an encoding along with the mode:
Binmoding Filehandles
Turn off processing of line ending:
Specify a layer to ensure filehandles know about intended encodings:
Bad Filehandles
Operator open returns true, if it succeeded or false otherwise:
Closing a Filehandle
Fatal Errors with die
Operator die terminates the program:
Special variable $! contains a human-readable error message.
die automatically appends program name and line number, where it failed.
Always check the status of open => The rest of the program relies upon it.
Warning Messages with warn
Use warn function to issue a warning and proceed with the code.
Automatically die-ing
Pragma autodie automatically calls die, if a file open fails:
Using Filehandles
Open a filehandle and read from it:
Use filehandle open for writing or appending:
There is no comma between a filehandle and values to print.
Changing the Default Output Filehandle
Functions print and printf output into STDOUT. Operator select changes the default behaviour:
Gentlemen set it back to STDOUT when they are done:
Reopening a Standard Filehandle
Send errors to a custom error log:
Output with say
Built-in say is the same as print, but it puts a newline at the end:
To interpolate an array quote it:
Specify a filehandle with say:
Filehandles in a Scalar
Use scalar in place of a bareword:
Combine two statements (declaring and opening):
Once filehandle is in a scalar, use it similarly to a bareword:
6. Hashes
What Is a Hash?
Hash is a data structure. It allows to look up hash values by name. Hash indices are called keys. They aren’t numbers, but arbitrary, unique strings.
There is neither a fixed order in a hash, nor a first element.
Hash is a collection of key-value pairs.
Hash keys are always unique. They are converted to strings. Same value can be stored more than once.
Why Use a Hash?
Use hash when one set of data “is related” to another set of data.
Hash Element Access
To access an element of a hash use this syntax:
Use curly braces instead of square brackets around the key. Key expression is now a string, not a number:
When choosing a hash name, think of the word “for” between the name of the hash and the key.
Hash elements spring into existence when you first assign to them:
This feature is called autovivification.
Accessing an element outside a hash returns undef.
The Hash as a Whole
Refer to the entire hash with a percent sign (%) as a prefix.
It is possible to convert a hash into a list and back again.
Assigning to a hash is a list-context assignment (list is key-value pairs):
Value of hash in a list context is a simple list of key-value pairs:
Turning the hash back into a list of key-value pairs is called unwinding in Perl.
Use a hash either when items order is not important or when you have an easy way to control their order.
But key-value pairs in a hash stay together, i.e. value will follow its key.
Hash Assignment
To copy a hash simply assign one hash to another:
This is a computationally expensive operation in Perl: first an %old_hash is unwound into a list, then it is assigned to a %new_hash one key-value pair at a time.
To inverse a hash write:
Perl uses the rule: “last one wins”. Later items in the list overwrite earlier ones.
The Big Arrow
In Perl grammar any comma , can be written as a big arrow => (fat comma).
Alternative way to set up a hash of last names is:
Perl shortcut: it’s possible to omit the quote marks on some hash keys, when you use a fat comma:
Use this shortcut in curly braces of a hash element reference:
Hash Functions
The keys and values Functions
The keys function yields a list of all keys in a hash, values function returns the corresponding values:
Use these functions in a scalar context to retrieve the number of elements in a hash:
Use hash in a Boolean context to find out, if a hash is not empty:
The each function
The each function returns a key-value pair as a two element list. It is a common way to iterate over a hash. Use this function in a while loop:
To go through the hash in order, sort the keys:
Typical Use of a Hash
Library database to keep track of how many books each person has checked out is a good example of a hash use:
See, whether an element of the hash is true or false:
The exists Function
The exists function returns true, iff the given key exists in the hash:
The delete Function
The delete function removes the given key-value pair from the hash:
Hash Element Interpolation
Single hash element can be interpolated into a double-quoted string:
There is no support for entire hash interpolation, e.g. %books. Beware of the magical characters ($, @, “, \, ‘, [, {, ->, ::) that need back-slashing.
The %ENV Hash
Perl program environment is stored in the %ENV hash:
In Perl, dollar sign ‘$’ means there is one of something, at sign ‘@’ means there is a list of something and percent sign ‘%’ means there is an entire hash.
7. Regular Expressions
Mastering Regular Expressions (3rd, 2006, O’Reilly Media) by Jeffrey Friedl [3]
Watch regexes with Regexp::Debugger
Know your character classes under different semantics
Number to match | Metacharacter | Generalised form |
---|---|---|
Optional | ? | {0,1} |
Zero or more | * | {0,} |
One or more | + | {1,} |
Minimum with no maximum | {3,} | |
Minimum with maximum | {3,5} | |
Exactly | {3} |
Shortcut | Matches | Note |
---|---|---|
\d | decimal digit | |
\D | not a decimal digit | |
\s | whitespace | |
\S | not whitespace | |
\h | horizontal whitespace | (v5.10 and later) |
\H | not horizontal whitespace | (v5.10 and later) |
\v | vertical whitespace | (v5.10 and later) |
\V | not vertical whitespace | (v5.10 and later) |
\R | generalised line ending | (v5.10 and later) |
\w | “word” character | |
\W | not a “word” character | |
\n | newline | (not really a shortcut) |
\N | non-newline | (stable in v5.18) |
8. Matching with Regular Expressions
Expression | Note |
---|---|
m/pattern/s | match any character, even newline |
m/pattern/i | case insensitive matching |
m/pattern/x | make whitespace inside pattern insignificant |
m/pattern/m | multiline matching | m/\Apattern/ | beginning of line |
m/pattern\Z/ | end of line |
m/\b{wb}/ | word boundary |
m/\b{sb}/ | sentence boundary |
m/\b{lb}/ | line boundary |
m/(pat)tern/ | capture group between (), available as $1 |
m/(pat)tern \1/ | capture group, reuse capture group in matching expression |
m/(?:pat)tern/ | non-capturing parentheses, do not save result into $1 |
m/(?<LABEL>pattern)/ | named capture, stored in %+, available as $+{LABEL} |
m/(?<LABEL>pattern \g<LABEL>)/ | named capture, reuse named capture group in matching expression |
'Brave new world!'=~ m/new/ | automatic match variables $& Brave, $` new, $' world! |
Regular expression feature | Example |
---|---|
Parentheses (grouping | capturing) | (...), (?:...), (?<LABEL>...) |
Quantifiers | a* a+ a? a{n,m} |
Anchors and sequence | abc ^ $ \A \b \z \Z |
Alternation | a|b|c |
Atoms | a [abc] \d \1 \g{2} |
Pattern test program
9. Processing Text with Regular Expressions
Expression | Note |
---|---|
s/before/after/ | substitution |
s/before/after/g | global replacement |
s/pattern/\U$1 | turn pattern to upper case with \U # PATTERN |
s/pattern/\L$1 | turn pattern to lower case with \L # pattern |
s/pattern/\u\L$1 | turn pattern to title case with \u\L # Pattern |
s/pattern/\l\U$1 | turn pattern to inverse title case with \l\U # pATTERN |
lc, uc, fc, lcfirst, ucfirst | case shifting functions |
my @fields = split /separator/, $string; | break up a string according to a pattern |
my $result = join $glue, @pieces; | glues together a bunch of pieces to make a single string |
Tom Christiansen on parsing HTML with regular expressions
Number to match | Metacharacter |
---|---|
?? | Zero matches (useless) |
*? | Zero or more, as few as possible |
+? | One or more, as few as possible |
{3,}? | At least three, but as few as possible |
{3,5}? | At least three, as many as five, but as few as possible |
{3}? | Exactly three |
Levels of regular expression compliance in Unicode Technical Report #18
Updating many files
Equivalent one-liner:
10. More Control Structures
Control structures
Statement modifiers
Examples
Naked block provides scope for temporary lexical variables:
elsif clause
Autoincrement and autodecrement
Value of autoincrement
Infinite loop
Loop controls
Conditional (ternary) operator
Logical operators
Defined-or (since v5.10)
11. Perl Modules
Finding modules
Check if module is installed by trying to read its documentation
Get details on a module with the cpan command
Installing modules
If the module uses ExtUtils::MakeMaker install new modules with
Modules based on Module::Build should be build and installed with
To use CPAN.pm for module installation and all of its dependencies issue
Another (user-friendly) alternative is to install modules with the cpan script (it comes with Perl)
Finally, there is cpanm (cpanminus) that is designed as a zero-configuration, lightweight CPAN client.
Using custom installation directories
Use local::lib to set environment variables affecting CPAN module installation. View default settings with
Call cpan with -I switch to respect local::lib settings
To make your Perl program aware of modules in custom locations issue
Using Simple Modules
Using Only Some Functions from a Module
Call functions by their full names
File::Spec Module
Path::Class Module
Databases and DBI
Programming the Perl DBI (2000, O’Reilly Media) by Tim Bunce and Alligator Descartes [4] DBI website
Example of using the DBI module:
Dates and Times
Example of using the Time::Moment module:
12. File Tests
File test operators
To get the list of all the file test operators type:
Processing a list of files:
File test | Meaning |
---|---|
-r | File or directory is readable by this (effective) user or group |
-w | File or directory is writable by this (effective) user or group |
-x | File or directory is executable by this (effective) user or group |
-o | File or directory is owned by this (effective) user or group |
-R | File or directory is readable by this real user or group |
-W | File or directory is writable by this real user or group |
-X | File or directory is executable by this real user or group |
-O | File or directory is owned by this real user or group |
-e | File or directory name exists |
-z | File exists and has zero size (always false for directories) |
-s | File exists and has nonzero size (the value is the size in bytes) |
-f | Entry is a plain file |
-d | Entry is a directory |
-l | Entry is a symbolic link |
-S | Entry is a socket link |
-p | Entry is a named pipe (a “fifo”) |
-b | Entry is a block-special file (like a mountable disk) |
-c | Entry is a character-special file (like an I/O device) |
-u | File or directory is setuid |
-g | File or directory is setgid |
-k | File or directory has a sticky bit set |
-t | The filehandle is a TTY (as reported by the isatty() system function; filenames can't be tested by this test) |
-T | File looks like a “text” file |
-B | File looks like a “binary” file |
-M | Modification age (measured in days) |
-A | Access age (measured in days) |
-C | Inode-modification age (measured in days) |
Using default filename stored in $_
Testing Several Attributes of the Same File
Calling system’s stat each time:
Re-using information from last file lookup:
Stacked File Test Operators (Starting with Perl 5.10)
The stat and lstat Functions
Return value of a call to stat is a 13-element list:
- $dev and $inode
- device number and inode number of the file
- $mode
- set of permission bits for the file and some other bits
- $nlink
- number of (hard) links to the file or directory
- $uid and $gid
- numeric user-ID and group-ID showing file’s ownership
- $size
- size in bytes as returned by the -s file test
- $atime, $mtime, $ctime
- three timestamps (access, modification, creation), tell number of seconds since the epoch
Use lstat function to obtain information on symbolic links.
File::stat module provides a friendlier interface to stat.
The localtime function
Function localtime in a scalar context converts a string into human readable date-time string:
localtime in a list context:
- $mon
- a month number [0, 11]
- $year
- number of years since 1900
- $wday
- weekday from Sunday to Saturday [0, 6]
- $yday
- day-of-the-year from Jan 1 through Dec 31 [0, 364(5)]
Function gmtime returns time in Universal Time.
Function time returns current timestamp from system clock.
Bitwise Operators
Expression | Meaning |
---|---|
10&12 | Bitwise-and—which bits are true in both operands (this gives 8) |
10|12 | Bitwise-or&mdashwhich bits are true in one operand or the other (this gives 14) |
10^12 | Bitwise-xor—which bits are true in one operand or the other but not in both (this gives 6) |
6<<2 | Bitwise shift left—shift the left operand the number of bits shown by the right operand, adding zero-bits at the least-significant places (this gives 24) |
25>>2 | Bitwise shift right—shift the left operand the number of bits shown by the right operand, discarding the least-significant bits (this gives 6) |
~10 | Bitwise negation, also called unary bit complement, returns the number with the opposite bit for each bit in the operand (this gives 0xFFFFFFF5) |
Using Bitstrings
Since Perl 5.22 it is possible to use either (all) numeric bitwise operation or (all) string:
Results in
13. Directory Operations
The current working directory
Obtain current working directory using the Cwd module:
Use the File::Spec module to convert between relative and absolute paths.
Changing the directory
Change directory with chdir:
Module File::HomeDir helps to set and get the environment variables for chdir.
Globbing
Use the glob operator to expand a pattern into the matching filenames:
An alternate syntax for globbing
Legacy Perl code uses angle-bracket syntax:
Directory handles
Obtain list of filenames from a directory using a directory handle:
Alternatively use a bareword directory handle:
Compared to globbing this is a lower-level operation with more manual work. The list includes all files, not just those matching a given pattern. Implement a skip-over function to obtain the necessary files:
Filename returned by the readdir operator has no pathname component! It’s only the name within the directory. Patch up the name to get the full name:
To improve portability use File::Spec::Functions module to construct the path:
Removing files
Use the unlink operator to remove files:
Combine unlink with glob to delete multiple files:
Return value of a successful call to unlink is a number of files deleted:
To know which unlink operation failed, process files in a loop:
It is possible to remove files that you can’t read, you can’t write, you can’t execute and you don’t even own (chmod 0).
Renaming files
Give a new name or move a file with the rename function:
Moving files to another disk partition with rename is not possible.
To batch rename a list of files:
Links and files
Create hard and soft links to files with:
To test whether a file is a symbolic link use:
To find out where a symbolic link is pointing, use the readlink function:
Making directories
Make a directory:
Second parameter is the initial permission setting in octal. Make sure to write it with a leading zero or use the oct function:
Use extra call to oct function when the value comes from user input:
Removing directories
To remove empty directories use rmdir function. Although, it removes one directory at a time:
Example of writing many temporary files during the execution of a program:
Check out File::Temp module for creating temporary directories or files and remove_tree function provided by the File::Path module.
Modifying permissions
Use chmod function to change permissions on a file or directory:
Symbolic permissions with ugoa and rwxXst are not supported by the chmod function. Use File::chmod module to enable symbolic mode values in chmod.
Changing ownership
Change the ownership and group membership of a list of files:
Apply helper functions getpwnam and getgrnam to convert user and group names into numbers:
Changing timesteps
Use utime function to update the access and modification time of a list of files:
14. Strings and Sorting
Finding a substring with index
To locate first occurrence of substring in a string use the index operator:
The character position returned by index is a zero-based value. If the substring was not found index returns -1.
The third parameter to index specifies, where to start searching for a given substring. By default index searches from the beginning of the string:
Operator rindex finds the last occurrence of the substring (i.e. scans from the end of the string).
Operator rindex counts from the left, from the beginning of the string.
Manipulating a substring with substr
Operator substr extracts a part of the string:
The $length parameter may be omitted, if the end of string is required. Initial position $start can be negative, counting from the end of the string. In this case -1 denotes the end of the string.
Functions index and substr work well together:
Change a given portion of the string with substr and assignment:
Giving a length of 0 allows to insert text without removing anything:
Use the binding operator (=~) to restrict an operation to work with a substring:
Alternatively use a four argument version of substr, where the fourth argument is the replacement string:
Formatting data with sprintf
Function sprintf returns the requested string instead of printing it:
Using sprintf with “money numbers”
Format a number with two digits after the decimal point:
To insert commas separating thousands in a number use the following subroutine:
Use modules Number::Format and CLDR::Number for pre-defined operations with numbers.
Advanced sorting
Write a custom comparison statement to specify the sorting order:
To apply custom sorting routine:
Three-way comparison for numbers is used frequently. Perl’s spaceship operator provides a shortcut for it:
Similarly the cmp operator defines a three-way comparison for strings:
To sort Unicode strings apply:
When the sorting routines are simple, use them “inline”:
The reversed order sorting may be obtained either with the reverse keyword:
or by swapping the operands:
Sorting a hash by value
Imagine bowling scores of three characters are stored in a hash:
Enable numeric comparison on the scores, rather than the names:
Sorting by multiple keys
Consider a forth entry in the scores hash:
If the players have the same score, sort their entries by name:
Example of a library program using a five-level sort:
15. Process Management
The system function
Launch a child process with:
Use single quotes, if Perl interpolation is not needed:
Use shell’s facility to launch a background process (Perl will not wait for it to finish):
Avoiding the shell
Invoke the system operator with multiple arguments to avoid the shell:
For security reasons choose a multi-argument call to system (vs a single-argument call).
System operator returns 0 on success (“0 but true” notion):
The environment variables
Modify environment variables to be inherited by child processes:
Rely on the Config module to use a path separator native for the operating system:
The exec function
The exec function causes the main Perl process to perform the requested action:
There is no main Perl process to return to after the exec function finishes.
Using backquotes to capture output
Place a command between backquotes to save its output:
Example: invoking the perldoc command repeatedly for a set of functions:
Avoid using backquotes, when output capture is not needed.
Using backquotes in a list context
Get the data automatically broken up by lines:
Use the result of a system command in a list context:
External processes with IPC::System::Simple
This module provides simpler interface compared to Perl’s built-in system utility:
To capture output replace the system command with capture:
Processes as filehandles
Using processes as filehandles provides the only easy way to write to a process based on the results of computation.
Launch a concurrent (parallel) child process with the piped open command:
The three-argument form is:
Read filehandle normally to obtain data:
Print with filehandle to send data:
Close filehandle to finish sending data:
In case of reading from process while data is not available, process is suspended until sending program speaks again.
For reading backquotes are easier to manage unless you want to have results as they come in.
Example of find
command printing results as they are found:
Getting down and dirty with fork
It is possible to access low-level process management system calls directly:
Re-implementation of system 'date'
:
Sending and receiving signals
Send “interrupt signal” SIGINT
to process with known ID:
Check, if process is still alive:
Assign into the special %SIG hash to activate the signal handler:
Example of a custom signal handler [1] (chapter 15, exercise 4, pp. 273, 324, 325):
16. Some Advanced Perl Techniques
Slices
Imagine the following list:
Extract a few elements using conventional arrays:
Assign the result of split to a list of scalars:
Use undef to ignore corresponding elements of the source list:
Extreme use case to extract mtime value from stat:
Instead, index into a list as if it were an array (with a list slice):
Use list slices to pull out items from the initial example:
List slice in a list context (merge two operations together):
Pull the first and last items from a list:
Use List::Util module for better, more efficient sorting.
Slice subscripts may be in any order and may repeat indices:
Array Slice
Parentheses may be omitted, when slicing elements from an array:
Interpolate a slice directly into a string:
Update selected elements of the array:
Hash Slice
Pull values with a list of hash keys or with a slice:
Slice is always a list. Hence, the hash slice notation uses an at sign.
Elegant assignment using hash slices:
Key-Value Slices
Since v5.20 it is possible to extract key-value pairs with a key-value slice:
Sigils do not denote variable type, they communicate what you do with the variable. Key-value pairs is a hashy sort of operation, hence there is % in front of it.
Trapping Errors
Using eval
Wrap code in an eval block to trap fatal errors:
In case of an error the eval block stops running, but the program doesn’t crash.
More Advanced Error Handling
In basic Perl, you may throw an exception with die and catch it with eval:
Inspect value of $@ to figure out what went wrong.
Dynamic scope of $@ may cause problems. Use module Try::Tiny from CPAN for better error handling.
Try::Tiny puts the error message into $_ to prevent abuse of $@.
Picking Items from a List with grep
Perl’s grep operator acts as a filter:
In scalar context grep tells the number of items selected:
Transforming Items from a List with map
Use map operator to change every item in a list:
Instead of returning a Boolean value as grep, map generates a list of values.
Simpler syntax of map:
Fancier List Utilities
List::Util module from Standard Library enables high performance list processing utilities:
First occurrence:
Sum:
Maximum numeric and textual:
Use shuffle to randomise order of elements in a list:
Use List::MoreUtils module for more advanced subroutines.
Match a condition with none, any, all:
Process n items at a time with natatime:
Combine two or more lists interweaving the elements with mesh:
References
- R. L. Schwartz, brian d foy, and T. Phoenix, Learning Perl, 7th ed. O’Reilly Media, 2017.
- “Perl power tools project, official website.” [Online]. Available at: https://perlpowertools.com
- J. Friedl, Mastering Regular Expressions, 3rd ed. O’Reilly Media, 2006.
- T. Bunce and A. Descartes, Programming with Perl DBI, 1st ed. O’Reilly Media, 2000.