=head1 TITLE
Exegesis 5: Pattern Matching
=head1 AUTHOR
Damian Conway <damian@conway.org>
=head1 VERSION
Maintainer: Larry Wall <larry@wall.org>
Date: 22 Aug 2002
Last Modified: 29 May 2006
Number: 5
Version: 2
[Update: Please note that this was written several years ago, and
a number of things have changed since then. Rather than changing
the original document, we'll be inserting "Update" notes like this
one to tell you where the design has since evolved. (For the better,
we hope). In any event, for the latest Perl 6 design (or to figure out
any cryptic remarks below) you should read the Synopses, which are kept
very much more up-to-date than either the Apocalypses or Exegeses.]
=over
=item I<Come gather round Mongers, whatever you code>
=item I<And admit that your forehead's about to explode>
=item I<'Cos Perl patterns induce complete brain overload>
=item I<If there's source code you should be maintainin'>
=item I<Then you better start learnin' Perl 6 patterns soon>
=item I<For the regexes, they are a-changin'>
=over
=item -- Bob Dylan, "The regexes they are a-changin'" (Perl 6 remix)
=back
=back
Apocalyse 5 marks a significant departure in the on-going design of Perl 6.
Previous Apocalypses took an evolutionary approach to changing Perl's
general syntax, data structures, control mechanisms, and operators. New
features were added, old features removed, existing features were
enhanced, extended, and simplified. But the changes described were
remedial, not radical.
Larry could have taken the same approach with regular expressions. He could
have tweaked some of the syntax, added new C<(?...)> constructs, cleaned
up the rougher edges, and moved on.
Fortunately, however, he's taking a much broader view of Perl's future
than that. And he saw that the problem with regular expressions was I<not>
that they lacked a C<(?$var:...)> extension to do named captures, or
that they needed a C<\R> metatoken to denote a recursive subpattern,
or that there was a C<[:YourNamedCharClassHere:]> mechanism missing.
He saw that those features, laudable as they were individually, would
just compound the real problem. Which was that Perl 5
regular expressions were already groaning under the accumulated weight
of their own metasyntax. And that a decade of accretion had left the
once-clean notation arcane, baroque, inconsistent, and obscure.
It was time to throw away the prototype.
Even more importantly, as powerful as Perl 5 regexes are, they are not nearly
powerful enough. Modern text manipulation is predominantly about processing
structured, hierarchical text. And that's just plain painful with regular
expressions. The advent of modules like Parse::Yapp and Parse::RecDescent
reflects the community's widespread need for more sophisticated parsing
mechanisms. Mechanisms that should be native to Perl.
As Piers Cawley has so eloquently misquoted: I<"It is a truth universally
acknowledged that any language in possession of a rich syntax must be in
want of a rewrite."> Perl regexes are such a language. And Apocalypse 5
is precisely that rewrite.
=head1 What's the diff?
So let's take a look at some of those new features. To do that, we'll
consider a series of examples structured around a common theme:
recognizing and manipulating data in the Unix
L<diff|http://www.gnu.org/manual/diffutils-2.8.1/html_node/Detailed-Normal.html>
A classic diff consists of zero-or-more text transformations, each of
which is known as a "hunk". A hunk consists of a modification specifier,
followed by one or more lines of context. Each hunk is either an append,
a delete, or a change, and the type of hunk is specified by a single
letter (C<'a'>, C<'d'>, or C<'c'>). Each of these single-letter specifiers is
prefixed by the line numbers of the lines in the original document it
affects, and followed by the equivalent line numbers in the transformed
file. The context information consists of the lines of the original file
(each preceded by a C<< '<' >> character), then the lines of the
transformed file (each preceded by a C<< '>' >>). Deletes omit the
transformed context, appends omit the original context. If both contexts
appear, they are separated by a line consisting of three hyphens.
Phew! You can see why natural language isn't the preferred way of
specifying data formats.
The preferred way is, of course, to specify such formats as patterns.
And, indeed, we could easily throw together a few Perl 6 patterns that
collectively would match any data conforming to that format:
$file = rx/ ^ <$hunk>* $ /;
$hunk = rx :i {
[ <$linenum> a :: <$linerange> \n
<$appendline>+
|
<$linerange> d :: <$linenum> \n
<$deleteline>+
|
<$linerange> c :: <$linerange> \n
<$deleteline>+
--- \n
<$appendline>+
]
|
(\N*) ::: { fail "Invalid diff hunk: $1" }
[Update: Backrefs are now numbered starting at 0.]
};
$linerange = rx/ <$linenum> , <$linenum>
| <$linenum>
/;
$linenum = rx/ \d+ /;
$deleteline = rx/^^ \< <sp> (\N* \n) /;
$appendline = rx/^^ \> <sp> (\N* \n) /;
# and later...
my $text is from($*ARGS);
print "Valid diff"
if $text =~ /<$file>/;
=head1 Starting gently
There's a lot of new syntax there, so let's step through it slowly,
starting with:
$file = rx/ ^ <$hunk>* $ /;
This statement creates a pattern object. Or, as it's known in Perl 6, a
"rule". People will probably still call them "regular expressions" or
"regexes" too (and the keyword C<rx> reflects that), but Perl patterns
long ago ceased being anything like "regular", so we'll try and avoid
those terms.
[Update: We've resurrected the term "regex" to refer to these patterns
in general. When we say "rule" now, we're specifically referring to the
kind of regex that you would use in a grammar. See S05.]
In any case, the C<rx> constructor builds a new rule, which is then
stored in the C<$file> variable. The Perl 5 equivalent would be:
# Perl 5
my $file = qr/ ^ (??{$hunk})* $ /x;
Which illustrates quite nicely why the entire syntax
needed to change.
The name of the rule constructor has changed from C<qr> to C<rx>,
because in Perl 6 rule constructors I<aren't> quotelike contexts.
In particular, variables don't interpolate into C<rx> constructors
in the way they do for a C<qq> or a C<qx>. That's why we can embed the
C<$hunk> variable before it's actually initialized.
In Perl 6, an embedded variable becomes part of the rule's implementation
rather than part of its "source code". As we'll see shortly, the pattern
itself can determine how the variable is treated (i.e. whether to
interpolate it literally, or treat it as a subpattern, or use it as a
container).
=head1 Lay it out for me
In Perl 6 every rule implicitly has the equivalent of the Perl 5 C</x> modifier
turned on, so we could lay out (and annotate) that first pattern like this:
$file = rx/ ^ # Must be at start of string
<$hunk> # Match what the rule in $hunk would match...
* # ...zero-or-more times
$ # Must be at end of string (no newline allowed)
/;
Because C</x> is the default, the whitespace in the pattern is ignored,
which allows us to lay the rule out more readably. Comments are also honoured,
which enables us to document the rule sensibly. You can even use the closing
delimiter in a comment safely:
$caveat = rx/ Make \s+ sure \s+ to \s+ ask
\s+ (mum|mom) # handle UK/US spelling
\s+ (and|or) # handle and/or
\s+ dad \s+ first
/;
Of course, the examples in this Exegesis I<don't> represent good
comments in general, since they document what is happening,
rather than why.
The meanings of the C<^> and C<*> metacharacters are unchanged
from Perl 5. However, the meaning of the C<$> metacharacter I<has>
changed slightly: it no longer allows an optional newline before the end
of the string. If you want that behaviour, you need to specify it
explicitly. For example, to match a line ending in digits: C</ \d+ \n? $/>
The compensation is that, in Perl 6, a C<\n> in a pattern matches a I<logical>
newline (that is any of: C<"\015\012"> or C<"\012"> or C<"\015">
or C<"\x85"> or C<"\x2028">), rather than just a
I<physical> ASCII newline (i.e. just C<"\012">). And a C<\n> will always
try to match any kind of physical newline marker (not just the current system's
favorite), so it correctly matches against strings that have been
aggregrated from multiple systems.
=head1 Interpolate ye not...
The really new bit in the C<$file> rule is the C<< <$hunk> >> element.
It's a directive to grab whatever's in the C<$hunk> variable (presumably
another pattern) and attempt to match it at that point in the rule. The
important point is that the contents of C<$hunk> are only grabbed when
the pattern matching mechanism actually needs to match against them,
I<not> when the rule is being constructed. So it's like the mysterious
C<(??{...})> construct in Perl 5 regexes.
The angle brackets themselves are a much more general mechanism in Perl 6 rules.
They are the "metasyntactic markers" and replace the Perl 5 C<(?...)> syntax.
They are used to specify numerous other features of Perl 6 rules, many of which
we will explore below.
Note that if we I<hadn't> put the variable in angle-brackets, and had
just written:
rx/ ^ $hunk* $ /;
then the contents of C<$hunk> would I<still> not be interpolated when
the pattern was parsed. Once again, the pattern would grab the
contents of the variable when it reached that point in its match.
But, this time, without the angle brackets around C<$hunk>, the
pattern would try to match the contents of the variable as an
atomic literal string (rather than as a subpattern). "Atomic"
means that the C<*> repetition quantifier applies to everything
that's in C<$hunk>, I<not> just to the last character
(as it does in Perl 5).
In other words, a raw variable in a Perl 6 pattern is matched
as if it was a Perl 5 regex in which the interpolation had been
C<quotemeta>'d and then placed in a pair of non-capturing parentheses.
That's really handy in something like:
# Perl 6
my $target = <>; # Get literal string to search for
$text =~ m/ $target* /; # Search for them as literals
[Update: C<< <> >> is no longer the input operator. And C<=~> has been
replaced by C<~~>.]
which in Perl 5 we'd have to write as:
# Perl 5
my $target = <>; # Get literal string to search for
chomp $target; # No autochomping in Perl 5
$text =~ m/ (?:\Q$target\E)* /x; # Search for it, quoting metas
Raw arrays and hashes interpolate as literals too. For example, if we
use an array in a Perl 6 pattern, the matcher will attempt to match any
of its elements (each as a literal). So:
# Perl 6
@cmd = ('get','put','try','find','copy','fold','spindle','mutilate');
$str =~ / @cmd \( .*? \) /; # Match a cmd, followed by stuff in parens
is the same as:
# Perl 5
@cmd = ('get','put','try','find','copy','fold','spindle','mutilate');
$cmd = join '|', map { quotemeta $_ } @cmd;
$str =~ / (?:$cmd) \( .*? \) /;
By the way, putting the array into angle brackets would cause the matcher to
try and match each of the array elements as a pattern, rather than as a literal.
=head1 The incredible C<$hunk>
The rule that C<< <$hunk> >> tries to match against is the next one defined
in the program. Here's the annotated version of it:
$hunk = rx :i { # Case-insensitively...
[ # Start a non-capturing group
<$linenum> # Match the subrule in $linenum
a # Match a literal 'a'
:: # Commit to this alternative
<$linerange> # Match the subrule in $linerange
\n # Match a newline
<$appendline> # Match the subrule in $appendline...
+ # ...one-or-more times
| # Or...
<$linerange> d :: <$linenum> \n # Match $linerange, 'd', $linenum, newline
<$deleteline>+ # Then match $deleteline once-or-more
| # Or...
<$linerange> c :: <$linerange> \n # Match $linerange, 'c', $linerange, newline
<$deleteline>+ # Then match $deleteline once-or-more
--- \n # Then match three '-' and a newline
<$appendline>+ # Then match $appendline once-or-more
] # End of non-capturing group
| # Or...
( # Start a capturing group
\N* # Match zero-or-more non-newlines
) # End of capturing group
::: # Emphatically commit to this alternative
{ fail "Invalid diff hunk: $1" } # Then fail with an error msg
};
The first thing to note is that, like a Perl 5 C<qr>, a Perl 6 C<rx> can take
(almost) any delimiters we choose. The C<$hunk> pattern uses C<{...}>, but
we could have used:
rx/pattern/ # Standard
rx[pattern] # Alternative bracket-delimiter style
rx<pattern> # Alternative bracket-delimiter style
rxE<laquo>formeE<raquo> # D<eacute>limiteurs trE<egrave>s chic
rx>pattern< # Inverted bracketing is allowed too (!)
[Update: Inverted brackets are no longer allowed.]
rxE<raquo>MusterE<laquo> # AnfE<uuml>hrungszeichen in richtiger Reihenfolge
rx!pattern! # Excited
rx=pattern= # Unusual
rx?pattern? # No special meaning in Perl 6
rx#pattern# # Careful with these: they disable internal comments
[Update: C<#> is no longer allowed as the delimiter in any quote-like
construct. It is always a comment. This is not a hardship in Perl 6--you
have plenty of Unicode characters to choose from.]
=head1 Modified modifiers
In fact, the only characters not permitted as C<rx> delimiters are
C<':'> and C<'('>. That's because C<':'> is the character used to
introduce pattern modifiers in Perl 6, and C<'('> is the character used
to delimit any arguments that might be passed to those pattern modifiers.
[Update: C<:> and C<#> are never permitted, but C<(> is now permitted
if there is whitespace before it.]
In Perl 6 pattern modifiers are placed I<before> the pattern, rather
than after it. That makes life easier for the parser, since it doesn't
have to go back and reinterpret the contents of a rule when it reaches
the end and discovers a C</s> or C</m> or C</i> or C</x>. And it makes life
easier for anyone reading the code...for precisely the same reason.
The only modifier used in the C<$hunk> rule is the C<:i> (case-insensitivity)
modifier, which works exactly as it does in Perl 5.
The other rule modifiers available in Perl 6 are:
=over
=item C<:e> or C<:each>
This is the replacement for Perl 5's C</g> modifier. It causes a
match (or substitution) to be attempted as many times as possible.
The name was changed because "each" is shorter and clearer in intent
than "globally". And because the C<:each> modifier can be combined with
other modifiers (see below) in such a way that it's no longer "global"
in its effect.
[Update: Switched back to C<:g> now.]
=item C<:x($count)>
This modifier is like C<:e>, in that it causes the match or substitution
to be attempted repeatedly. However, unlike C<:e>, it specifies exactly
how many times the match must succeed. For example:
"fee fi " =~ m:x(3)/ (f\w+) /; # fails
"fee fi fo" =~ m:x(3)/ (f\w+) /; # succeeds (matches "fee","fi","fo")
"fee fi fo fum" =~ m:x(3)/ (f\w+) /; # succeeds (matches "fee","fi","fo")
Note that the repetition count doesn't have to be a constant:
m:x($repetitions)/ pattern /
There is also a series of tidy abbreviations for all the constant cases:
m:1x/ pattern / # same as: m:x(1)/ pattern /
m:2x/ pattern / # same as: m:x(2)/ pattern /
m:3x/ pattern / # same as: m:x(3)/ pattern /
# etc.
=item C<:nth($count)>
This modifier causes a match or substitution to be attempted repeatedly,
but to ignore the first C<$count-1> successful matches. For example:
my $foo = "fee fi fo fum";
$foo =~ m:nth(1)/ (f\w+) /; # succeeds (matches "fee")
$foo =~ m:nth(2)/ (f\w+) /; # succeeds (matches "fi")
$foo =~ m:nth(3)/ (f\w+) /; # succeeds (matches "fo")
$foo =~ m:nth(4)/ (f\w+) /; # succeeds (matches "fum")
$foo =~ m:nth(5)/ (f\w+) /; # fails
$foo =~ m:nth($n)/ (f\w+) /; # depends on the numeric value of $n
$foo =~ s:nth(3)/ (f\w+) /bar/; # $foo now contains: "fee fi bar fum"
Again there is also a series of abbreviations:
$foo =~ m:1st/ (f\w+) /; # succeeds (matches "fee")
$foo =~ m:2nd/ (f\w+) /; # succeeds (matches "fi")
$foo =~ m:3rd/ (f\w+) /; # succeeds (matches "fo")
$foo =~ m:4th/ (f\w+) /; # succeeds (matches "fum")
$foo =~ m:5th/ (f\w+) /; # fails
$foo =~ s:3rd/ (f\w+) /bar/; # $foo now contains: "fee fi bar fum"
By the way, Perl isn't going to be pedantic about these "ordinal" versions
of repetition specifiers. If you're not a native English speaker, and you
find C<:1th>, C<:2th>, C<:3th>, C<:4th>, etc. easier to remember, that's
perfectly okay.
The various types of repetition modifiers can also be combined
by separating them with additional colons:
my $foo = "fee fi fo feh far foo fum ";
$foo =~ m:2nd:2x/ (f\w+) /; # succeeds (matches "fi", "feh")
$foo =~ m:each:2nd/ (f\w+) /; # succeeds (matches "fi", "feh", "foo")
$foo =~ m:x(2):nth(3)/ (f\w+) /; # succeeds (matches "fo", "foo")
$foo =~ m:each:3rd/ (f\w+) /; # succeeds (matches "fo", "foo")
$foo =~ m:2x:4th/ (f\w+) /; # fails (not enough matches to satisfy :2x)
$foo =~ m:4th:each/ (f\w+) /; # succeeds (matches "feh")
$foo =~ s:each:2nd/ (f\w+) /bar/; # $foo now "fee bar fo bar far bar fum ";
Note that the order in which the two modifiers are specified doesn't matter.
=item C<:p5> or C<:perl5>
This modifier causes Perl 6 to interpret the contents of a rule as
a regular expression in Perl 5 syntax. This is mainly provided as a
transitional aid for porting Perl 5 code. And to mollify the
curmudgeonly.
[Update: Now C<:P5> or C<:Perl5> to avoid confusion with C<:p>.]
=item C<:w> or C<:word>
This modifier causes whitespace appearing in the pattern to match optional
whitespace in the string being matched. For example, instead of having
to cope with optional whitespace explicitly:
$cmd =~ m/ \s* <keyword> \s* \( [\s* <arg> \s* ,?]* \s* \)/;
we can just write:
$cmd =~ m:w/ <keyword> \( [ <arg> ,?]* \)/;
The C<:w> modifier is also smart enough to detect those cases where
the whitespace should actually be mandatory. For example:
$str =~ m:w/a symmetric ally/
is the same as:
$str =~ m/a \s+ symmetric \s+ ally/
rather than:
$str =~ m/a \s* symmetric \s* ally/
So it won't accidentally match strings like C<"asymmetric ally"> or
C<"asymmetrically">.
[Update: This is now the C<:s>/C<:sigspace> modifier, and is implicit
in C<rule> declarations.]
=item C<:any>
This modifier causes the rule to match a given string in every possible
way, simultaneously, and then return all the possible matches. For example:
my $str = "ahhh";
@matches = $str =~ m/ah*/; # returns "ahhh"
@matches = $str =~ m:any/ah*/; # returns "ahhh", "ahh", "ah", "a"
[Update: This split into two modifiers, C<:overlap> and C<:exhaustive>.]
=item C<:u0>, C<:u1>, C<:u2>, C<:u3>
These modifiers specify how the rule matches the dot (C<.>)
metacharacter against Unicode data. If C<:u0> is specified, dot matches
a single byte; if C<:u1> is specified, dot matches a single codepoint
(i.e. one or more bytes representing a single unicode "character").
If C<:u2> is specified, dot matches a single grapheme (i.e. a base
codepoint followed by zero or more modifier codepoints, such as
accents). If C<:u3> is specified, dot matches an appropriate "something"
in a language-dependent manner.
It's okay to ignore this modifier if you're not using Unicode (and maybe even
if you are). As usual, Perl will try very hard to Do The Right Thing.
To that end, the default behaviour of rules is C<:u2>, unless an
overriding pragma (e.g. C<use bytes>) is in effect.
[Update: Now C<:bytes>, C<:codes>, C<:graphs>, and C<:langs>.]
=back
Note that the C</s>, C</m>, and C</e> modifiers are no longer available.
This is because they're no longer needed. The C</s> isn't needed because
the C<.> (dot) metacharacter now matches newlines as well. When we want
to match "anything except a newline", we now use the new C<\N> metatoken
(i.e. "opposite of C<\n>").
The C</m> modifier isn't required because C<^> and C<$> always mean start and
end of string respectively. To match the start and end of a line, we use
the new C<^^> and C<$$> metatokens instead.
The C</e> modifier is no longer needed because Perl 6 provides the
C<$(...)> string interpolator (as described in Apocalypse 2). So a
substitution such as:
# Perl 5
s/(\w+)/ get_val_for($1) /e;
becomes just:
# Perl 6
s/(\w+)/$( get_val_for($1) )/;
[Update: This is now written:
s/(\w+)/{ get_val_for($0) }/;
or
s/ \w+ /{ get_val_for($()) }/;
]
=head1 Take no prisoners
The first character of the C<$hunk> rule is an opening square bracket.
In Perl 5 that denoted the start of a character class, but not in Perl 6.
In Perl 6, square brackets mark the boundaries of a non-capturing
group. That is, a pair of square brackets in Perl 6 are the same as a
C<(?:...)> in Perl 5, but less line-noisy.
By the way, to get a character class in Perl 6, we need to put the
square brackets inside a pair of metasyntactic angle brackets.
So the Perl 5:
# Perl 5
/ [A-Za-z] [0-9]+ /x # An A-Z or a-z, followed by digits
would become in Perl 6:
# Perl 6
/ <[A-Za-z]> <[0-9]>+ / # An A-Z or a-z, followed by digits
The Perl 5 complemented character class:
# Perl 5
/ [^A-Za-z]+ /x # One-or-more chars-that-aren't-A-Z-or-a-z
becomes in Perl 6:
# Perl 6
/ <-[A-Za-z]>+ / # One-or-more chars-that-aren't-A-Z-or-a-z
The external minus sign is used (instead of an internal caret) because
Perl 6 allows proper set operations on character classes, and the minus sign is
the "difference" operator. So we could also create:
# Perl 6
/ < <alpha> - [A-Za-z] >+ / # All alphabetics except A-Z or a-z
# (i.e. the accented alphabetics)
[Update: Would now need to be C<< <+<alpha> - [A-Za-z]> >> to avoid ambiguity
with "Texas quotes", and because we want to reserve whitespace as the first
character inside the angles for other uses.]
Explicit character classes were deliberately made a little less
convenient in Perl 6, because they're generally a bad idea in a
Unicode world. For example, the C<[A-Za-z]> character class in the
above examples won't even match standard alphabetic Latin-1
characters like C<'E<Atilde>'>, C<'E<eacute>'>, C<'E<oslash>'>,
let alone alphabetic characters from code-sets such as Cyrillic,
Hiragana, Ogham, Cherokee, or Klingon.
=head1 Meanwhile, back at the C<$hunk>...
The non-capturing group of the C<$hunk> pattern groups together three
alternatives, separated by C<|> metacharacters (as in Perl 5).
The first alternative:
<$linenum> a :: <$linerange>
\n
<$appendline>+
grabs whatever is in the C<$linenum> variable, treats it as a
subpattern, and attempts to match against it. It then matches a
literal letter C<'a'> (or an C<'A'>, because of the C<:i> modifier on the rule).
Then whatever the contents of the C<$linerange> variable match. Then a
newline. Then it tries to match whatever the pattern in C<$appendline>
would match, one-or-more times.
But what about that double-colon after the C<a>? Shouldn't the
pattern have tried to match two colons at that point?
=head1 This or nothing
Actually, no. The double-colon is a new Perl 6 pattern control
structure. It has no effect (and is ignored) when the pattern is
successfully matching, but if the pattern match should fail, and consequently
back-track over the double-colon -- for example, to try and re-match an
earlier repetition one fewer times -- the double-colon causes the entire
surrounding group (i.e. the surrounding C<[...]> in this case) to fail as well.
That's a useful optimization in this case because, if we match a line
number followed by an C<'a'> but subsequently fail, then there's no
point even trying either of the other two alternatives in the same
group. Because we found an C<'a'>, there's no chance we could match a
C<'d'> or a C<'c'> instead.
So, in general, a double-colon means: "At this point I'm committed to this
alternative within the current group -- don't bother with the others
if this one fails after this point".
There are other control directives like this too. A single colon means: "Don't
bother backtracking into the previous element". That's useful in a pattern
like:
rx:w/ $keyword [-full|-quick|-keep]+ : end /
Suppose we successfully match the keyword (as a literal, by the way) and
one-or-more of the three options, but then fail to match C<'end'>.
In that case, there's no point backtracking and trying to match one
fewer options, and I<still> failing to find an C<'end'>. And then
backtracking I<another> option, and failing again, etc. By using
the colon after the repetiton, we tell the matcher to give up after
the very first attempt.
However, the single colon isn't just a "Greed is Good" operator. It's
much more like a "Resistance is Futile" operator. That is, if the
preceding repetition had been non-greedy instead:
rx:w/ $keyword [-full|-quick|-keep]+? : end /
then backtracking over the colon would prevent the C<+?> from attempting
to match I<more> options. Note that this means that C<x+?:> is just a
baroque way of matching exactly one repetition of C<x>, since the
non-greedy repetition initially tries to match the minimal number of
times (i.e. once) and the trailing colon then prevents it from
backtracking and trying longer matches. Likewise, C<x*?:> and C<x??:>
are arcane ways of matching exactly zero repetitions of C<x>.
Generally, though, a single colon tells the pattern matcher that there's
no point trying any other match on the preceding repetition, because
retrying (whether more or fewer repetitions) would just waste time and
would still fail.
There's also a three-colon directive. Three colons means: "If we have
to backtrack past here, cause the entire rule to fail" (i.e. not just
this group). If the double-colon in C<$hunk> had been triple:
<$linenum> a ::: <$linerange>
\n
<$appendline>+
then matching a line number and an C<'a'> and subsequently failing would
cause the entire C<$hunk> rule to fail immediately (though the C<$file>
rule that invoked it might still match successfully in some other way).
So, in general, a triple-colon specifies: "At this point I'm committed to this
way of matching the current rule -- give up on the rule completely if the
matching process fails at this point".
Four colons...would just be silly. So, instead, there's a special named
directive: C<< <commit> >>. Backtracking through a C<< <commit> >>
causes the entire match to immediately fail. And if the current rule is
being matched as part of a larger rule, that larger rule will fail as
well. In other words, it's the "Blow up this Entire Planet and Possibly
One or Two Others We Noticed on our Way Out Here" operator.
If the double-colon in C<$hunk> had been a C<< <commit> >> instead:
<$linenum> a <commit> <$linerange>
\n
<$appendline>+
then matching a line number and an C<'a'> and subsequently
failing would cause the entire C<$hunk> rule to fail immediately, I<and>
would also cause the C<$file> rule that invoked it to fail immediately.
So, in general, a C<< <commit> >> means: "At this point I'm
committed to this way of completing the current match -- give up
all attempts at matching anything if the matching process fails at
this point".
=head1 Failing with style
The other two alternatives:
| <$linerange> d :: <$linenum> \n
<$deleteline>+
| <$linerange> c :: <$linerange> \n
<$deleteline>+ --- \n <$appendline>+
are just variants on the first.
If none of the three alternatives in the square brackets matches, then the
alternative outside the brackets is tried:
| (\N*) ::: { fail "Invalid diff hunk: $1" }
This captures a sequence of non-newline characters (C<\N> means "not
C<\n>", in the same way C<\S> means "not C<\s>" or C<\W> means "not
C<\w>"). Then it invokes a block of Perl code inside the pattern. The
call to C<fail> causes the match to fail at that point, and sets an
associated error message that would subsequently appear in the C<$!>
error variable (and which would also be accessible as part of C<$0>).
[Update: The old C<$0> variable has been renamed C<$/>.]
Note the use of the triple colon after the repetition. It's needed
because the C<fail> in the block will cause the pattern match to
backtrack, but there's no point backing up one character and trying
again, since the original failure was precisely what we wanted. The
presence of the triple-colon causes the entire rule to
fail as soon as the backtracking reaches that point the first time.
The overall effect of the C<$hunk> rule is therefore either to match one
hunk of the diff, or else fail with a relevant error message.
=head1 Home, home on the (line)range
The third and fourth rules:
$linerange = rx/ <$linenum> , <$linenum>
| <$linenum>
/;
$linenum = rx/ \d+ /;
specify that a line number consists of a series of digits, and that a line
range consists of either two line numbers with a comma between them or a single
line number. The C<$linerange> rule could also have been written:
$linerange = rx/ <$linenum> [ , <$linenum> ]? /;
which might be marginally more efficient, since it doesn't have to
backtrack and rematch the first C<$linenum> in the second alternative.
It's likely, however, that the rule optimizer will detect such cases and
automatically hoist the common prefix out anyway, so it's probably not
worth the decrease in readability to do that manually.
=head1 What's my line?
The final two rules specify the structure of individual context lines in the
diff (i.e. the lines that say what text is being added or removed by the hunk):
$deleteline = rx/^^ \< <sp> (\N* \n) /
$appendline = rx/^^ \> <sp> (\N* \n) /
The C<^^> markers ensure that each rule starts at the beginning
of an entire line.
The first character on that line must be either a C<< '<' >> or a
C<< '>' >>. Note that we have to escape these characters since angle
brackets are metacharacters in Perl 6. An alternative would be to use
the "literal string" metasyntax:
$deleteline = rx/^^ <'<'> <sp> (\N* \n) /
$appendline = rx/^^ <'>'> <sp> (\N* \n) /
That is, angle brackets with a single-quoted string inside them match
the string's sequence of characters as literals (including whitespace
and other metatokens).
Or we could have used the quotemeta metasyntax (C<\Q[...]>):
$deleteline = rx/^^ \Q[<] <sp> (\N* \n) /
$appendline = rx/^^ \Q[>] <sp> (\N* \n) /
Note that Perl 5's C<\Q...\E> construct is replaced in Perl 6 by
just the C<\Q> marker, which now takes a group after it.
[Update: There is no longer any \Q (or \L or \U for that matter).]
We could also have used a single-letter character class:
$deleteline = rx/^^ <[<]> <sp> (\N* \n) /
$appendline = rx/^^ <[>]> <sp> (\N* \n) /
or even a named character (C<\c[CHAR NAME HERE]>):
$deleteline = rx/^^ \c[LEFT ANGLE BRACKET] <sp> (\N* \n) /
$appendline = rx/^^ \c[RIGHT ANGLE BRACKET] <sp> (\N* \n) /
Whether any of those MTOWTDI is better than just escaping the angle bracket is,
of course, a matter of personal taste.
=head1 The final frontier
After the leading angle, a single literal space is expected. Again, we
could have specified that by escapology (C<\ >) or literalness
(C<< <' '> >>) or quotemetaphysics (C<\Q[ ]>) or character classification
(C<< <[ ]> >>), or deterministic nomimalism (C<\c[SPACE]>), but Perl 6
also gives us a simple I<name> for the space character: C<< <sp> >>.
This is the preferred option, since it reduces line-noise and makes the
significant space much harder to miss.
Perl 6 provides predefined names for other useful subpatterns as
well, including:
=over
=item C<< <dot> >>
which matches a literal dot (C<'.'>) character (i.e. it's a more elegant
synonym for C<\.>);
=item C<< <lt> >> and C<< <gt> >>
which match a literal C<< '<' >> and C<< '>' >> respectively. These
give us yet another way of writing:
$deleteline = rx/^^ <lt> <sp> (\N* \n) /
$appendline = rx/^^ <gt> <sp> (\N* \n) /
=item C<< <ws> >>
which matches any sequence of whitespace (i.e. it's a more elegant
synonym for C<\s+>). Optional whitespace is, therefore, specified
as C<< <ws>? >> or C<< <ws>* >> (Perl 6 will accept either);
[Update: C<< <ws> >> is now the optional sigspace rule for the current
grammar.]
=item C<< <alpha> >>
which matches a single alphabetic character (i.e. it's like the
character class C<< <[A-Za-z]> >> but it handles accented characters
and alphabetic characters from non-Roman scripts as well);
=item C<< <ident> >>
which is a short-hand for C<< [ [<alpha>|_] \w* ] >> (i.e. a
standard identifier in many languages, including Perl)
=back
Using named subpatterns like these makes rules clearer in intent,
easier to read, and more self-documenting. And, as we'll see
L<shortly|"What's in a name?">, they're fully generalizable...we
can create our own.
=head1 Match-maker, match-maker...
Finally, we're ready to actually read in and match a diff file. In Perl 5, we'd
do that like so:
# Perl 5
local $/; # Disable input record separator (enable slurp mode)
my $text = <>; # Slurp up input stream into $text
print "Valid diff"
if $text =~ /$file/;
We could do the same thing in Perl 6 (though the syntax would differ
slightly) and in this case that would be fine. But, in general, it's
clunky to have to slurp up the entire input before we start matching.
The input might be huge, and we might fail early. Or we
might want to match input interactively (and issue an error message as
soon as the input fails to match). Or we might be matching a series of
different formats. Or we might want to be able to leave the input stream
in its original state if the match fails.
The inability to do pattern matches immediately on an input stream is one of
Perl 5's few weaknesses when it comes to text processing. Sure, we can read
line-by-line and apply pattern matching to each line, but trying to
match a construct that may be laid out across an unknown number of lines
is just painful.
Not in Perl 6 though. In Perl 6, we can bind an input stream to a scalar
variable (i.e. like a Perl 5 tied variable) and then just match on the
characters in that stream as if they were already in memory:
my $text is from($*ARGS); # Bind scalar to input stream
print "Valid diff"
if $text =~ /<$file>/; # Match against input stream
The important point is that, after the match, only those characters that
the pattern actually matched will have been removed from the input stream.
It may also be possible to skip the variable entirely and just write:
print "Valid diff"
if $*ARGS =~ /<$file>/; # Match against input stream
or:
print "Valid diff"
if <> =~ /<$file>/; # Match against input stream
but that's yet to be decided.
=head1 A cleaner approach
The previous example solves the problem of recognizing a valid diff file quite
nicely (and with only six rules!), but it does so by cluttering
up the program with a series of variables storing those precompiled patterns.
It's as if we were to write a collection of subroutines like this:
my $print_name = sub ($data) { print $data{name}, "\n"; };
my $print_age = sub ($data) { print $data{age}, "\n"; };
my $print_addr = sub ($data) { print $data{addr}, "\n"; };
my $print_info = sub ($data) {
$print_name($data);
$print_age($data);
$print_addr($data);
};
# and later...
$print_info($info);
You I<could> do it that way, but it's not the right way to do it.
The right way to do it is as a collection of named subroutines or methods,
often collected together in the namespace of a class or module:
module Info {
sub print_name ($data) { print $data{name}, "\n"; }
sub print_age ($data) { print $data{age}, "\n"; }
sub print_addr ($data) { print $data{addr}, "\n"; }
sub print_info ($data) {
print_name($data);
print_age($data);
print_addr($data);
}
}
Info::print_info($info);
So it is with Perl 6 patterns. You I<can> write them as a series of
pattern objects created at run-time, but they're much better specified as
a collection of named patterns, collected together at compile-time
in the namespace of a grammar.
Here's the previous diff-parsing example rewritten that way
(and with a few extra bells-and-whistles added in):
grammar Diff {
rule file { ^ <hunk>* $ }
rule hunk :i {
[ <linenum> a :: <linerange> \n
<appendline>+
|
<linerange> d :: <linenum> \n
<deleteline>+
|
<linerange> c :: <linerange> \n
<deleteline>+
--- \n
<appendline>+
]
|
<badline("Invalid diff hunk")>
}
rule badline ($errmsg) { (\N*) ::: { fail "$errmsg: $1" }
rule linerange { <linenum> , <linenum>
| <linenum>
}
rule linenum { \d+ }
rule deleteline { ^^ <out_marker> (\N* \n) }
rule appendline { ^^ <in_marker> (\N* \n) }
rule out_marker { \< <sp> }
rule in_marker { \> <sp> }
}
# and later...
my $text is from($*ARGS);
print "Valid diff"
if $text =~ /<Diff.file>/;
=head1 What's in a name?
The C<grammar> declaration creates a new namespace for rules
(in the same way a C<class> or C<module> declaration creates
a new namespace for methods or subroutines). If a block is
specified after the grammar's name:
grammar HTML {
rule file :iw { \Q[<HTML>] <head> <body> \Q[</HTML>] }
rule head :iw { \Q[<HEAD>] <head_tag>+ \Q[<HEAD>] }
# etc.
} # Explicit end of HTML grammar
then that new namespace is confined to that block. Otherwise the
namespace continues until the end of the source section of the
current file:
grammar HTML;
rule file :iw { \Q[<HTML>] <head> <body> \Q[</HTML>] }
rule head :iw { \Q[<HEAD>] <head_tag>+ \Q[<HEAD>] }
# etc.
# Implicit end of HTML grammar
__END__
Note that, as with the blockless variants on C<class> and C<module>,
this form of the syntax is designed to simplify one-namespace-per-file
situations. It's a compile-time error to put two or more blockless
grammars, classes or modules in a single file.
Within the namespace, named rules are defined using the C<rule>
declarator. It's analogous to the C<sub> declarator within a
module, or the C<method> declarator within a class. Just like
a class method, a named rule has to be invoked through its grammar
if we refer to it outside its own namespace. That's why the actual
match became:
$text =~ /<Diff.file>/; # Invoke through grammar
If we want to match a named rule, we put the name in angle brackets.
Indeed, many of the constructs we've already seen -- C<< <sp> >>,
C<< <ws> >>, C<< <ident> >>, C<< <alpha> >>, C<< <commit> >> --
are really just predefined named rules that come standard with Perl 6.
Like subroutines and methods, within their own namespace, rules don't
have to be qualified. Which is why we can write things like:
rule linerange { <linenum> , <linenum>
| <linenum>
}
instead of:
rule linerange { <Diff.linenum> , <Diff.linenum>
| <Diff.linenum>
}
Using named rules has several significant advantages, apart from making
the patterns look cleaner. For one thing, the compiler may be able to
optimize the embedded named rules better. For example, it could inline
the attempts to match C<< <linenum> >> within the C<linerange> rule. In
the C<rx> version:
$linerange = rx{ <$linenum> , <$linenum>
| <$linenum>
};
that's not possible, since the pattern matching mechanism won't know
what's in C<$linenum> until it actually tries to perform the match.
By the way, we I<can> still use interpolated C<< <$subrule> >>-ish
subpatterns in a named rule, and we can use named subpatterns
in an C<rx>-ish rule. The difference between C<rule> and C<rx>
is just that a C<rule> can have a name and must use C<{...}> as its
delimiters, whereas an C<rx> doesn't have a name and can use
any allowed delimiters.
[Update: Wherever this document uses the C<rule> keyword, that is now
the C<regex> keyword. There are keywords for C<token> and C<rule> but
they do something different now.]
=head1 Bad line! No match!
This version of the diff parser has an additional rule, named C<badline>.
This rule illustrates another similarity between rules and subroutines/methods:
rules can take arguments. The C<badline> rule factors out the error message
creation at the end of the C<hunk> rule. Previously that rule ended with:
| (\N*) ::: { fail "Invalid diff hunk: $1" }
but in this version it ends with:
| <badline("Invalid diff hunk")>
That's a much better abstraction of the error condition. It's easier to
understand and easier to maintain, but it does require us to be able to pass
an argument (the error message) to the new C<badline> subrule. To do
that, we simply declare it to have a parameter list:
rule badline($errmsg) { (\N*) ::: { fail "$errmsg: $1" }
Note the strong syntactic parallel with a subroutine definition:
sub subname($param) { ... }
The argument is passed to a subrule by placing it in parentheses
after the rule name within the angle brackets:
| <badline("Invalid diff hunk")>
The argument can also be passed without the parentheses, but then
it is interpreted as if it were the body of a separate rule:
rule list_of ($pattern) {
<$pattern> [ , <$pattern> ]*
}
# and later...
$str =~ m:w/ \[ # Literal opening square bracket
<list_of \w\d+> # Call list_of subrule passing rule rx/\w\d+/
\] # Literal closing square bracket
/;
A rule can take as many arguments as it needs to:
rule seplist($elem, $sep) {
<$elem> [ <$sep> <$elem> ]*
}
and those arguments can also be passed by name, using the standard
Perl 6 pair-based mechanism (as described in Apocalypse 3).
$str =~ m:w/
\[ # literal left square bracket
<seplist(sep=>":", elem=>rx/<ident>/)> # colon-separated list of identifiers
\] # literal right square bracket
/;
Note that the list's element specifier is itself an anonymous rule,
which the C<seplist> rule will subsequently interpolate as a pattern (because
the C<$elem> parameter appears in angle brackets within C<seplist>).
=head1 Thinking ahead
The only other change in the grammar version of the diff parser is that
the matching of the C<< '<' >> and C<< '>' >> at the start of the context
lines has been factored out. Whereas before we had:
$deleteline = rx/^^ \< <sp> (\N* \n) /
$appendline = rx/^^ \> <sp> (\N* \n) /
now we have:
rule deleteline { ^^ <out_marker> (\N* \n) }
rule appendline { ^^ <in_marker> (\N* \n) }
rule out_marker { \< <sp> }
rule in_marker { \> <sp> }
That seems like a step backwards, since it complicated the grammar for
no obvious benefit, but the benefit will be reaped L<later|"Different diffs">
when we discover another type of diff file that uses different
markers for incoming and outgoing lines.
=head1 What you match is what you get
Both the variable-based and grammatical versions of the code above do a
great job of I<recognizing> a diff, but that's all they do. If we only
want syntax checking, that's fine. But, generally, if we're parsing data
what we really want is to do something useful with it: transform it into
some other syntax, make changes to its contents, or perhaps convert it
to a Perl internal data structure for our program to manipulate.
Suppose we did want to build a hierarchical Perl data structure
representing the diff that the above examples match. What extra
code would we need?
None.
That's right. Whenever Perl 6 matches a pattern, it I<automatically>
builds a "result object" representing the various components of the
match.
That result object is named C<$0> (the program's name is now
C<$*PROG>) and it's lexical to the scope in which the match occurs. The
result object stores (amongst other things) the complete string matched
by the pattern, and it evaluates to that string when used in a string
context. For example:
if ($text =~ /<Diff.file>/) {
$difftext = $0;
}
That's handy, but not really useful for extracting data structures.
However, in addition, any components within a match that were captured
using parentheses become elements of the object's array attribute, and are
accessible through its array index operator. So,
for example, when a pattern such as:
rule linenum_plus_comma { (\d+) (,?) };
matches sucessfully, the array element 1 of the result object
(i.e. C<$0[1]>) is assigned the result of the first parenthesized
capture (i.e. the digits), whilst the array element 2
(C<$0[2]>) receives the comma. Note that array element
zero of any result object is assigned the complete string that the pattern
matched.
[Update: This has all changed. C<$/[0]> is the first result, same as C<$0>.
The entire result is extracted from a result object by casting to scalar,
with C<$()> returning the scalar value of C<$($/)>.]
There are also abbreviations for each of the array elements of C<$0>.
C<$0[1]> can also be referred to as...surprise, surprise...C<$1>,
C<$0[2]> can also be referred to as C<$2>, C<$0[3]> as C<$3>, etc.
Like C<$0>, each of these numeric variables is also lexical to the scope
in which the pattern match occurred.
The parts of a matched string that were matched by a named
subrule become entries in the result object's hash attribute, and are
subsequently accessible through its hash lookup operator.
So, for example, when the pattern:
rule deleteline { ^^ <out_marker> (\N* \n) }
matches, the result object's hash entry for the key C<'out_marker'> (i.e.
C<$0{out_marker}>) will contain the result object returned by the successful
nested match of the C<out_marker> subrule.
[Update: Curly subscripts no longer autoquote, so that's now
C<< $/<outmarker> >>, or C<< $<outmarker> >> for short.]
=head1 A hypothetical solution to a very real problem
Named capturing into a hash is very convenient, but it doesn't work so well
for a rule like:
rule linerange {
<linenum> , <linenum>
| <linenum>
}
The problem is that the hash attribute of the rule's
C<$0> can only store one entry with the key C<'linenum'>.
So if the C<< <linenum> , <linenum> >> alternative matches,
then the result object from the second match of
C<< <linenum> >> will overwrite the entry for the first
C<< <linenum> >> match.
The solution to this is a new Perl 6 pattern matching feature known
as "hypothetical variables". A hypothetical variable is
a variable that is declared and bound within a pattern match
(i.e. inside a closure within a rule).
The variable is declared, not with a C<my>, C<our>, or C<temp>,
but with the new keyword C<let>, which was chosen because it's what
mathematicians and other philosophers use to indicate a hypothetical
assumption.
Once declared, a hypothetical variable is then bound using the
normal binding operator. For example:
rule checked_integer {
(\d+) # Match and capture one-or-more digits
{ let $digits := $1 } # Bind to hypothetical var $digits
- # Match a hyphen
(\d) # Match and capture one digit
{ let $check := $2 } # Bind to hypothetical var $check
}
In this example, if a sequence of digits is found, then the C<$digits>
variable is bound to that substring. Then, if the dash and check-digit
are matched, the digit is bound to C<$check>. However, if the dash or
digit is not matched, the match will fail and backtrack through the
closure. This backtracking causes the C<$digits> hypothetical variable
to be automatically I<un-bound>. Thus, if a rule fails to match,
the hypothetical variables within it are not associated with any value.
Each hypothetical variable is really just another name for the
corresponding entry in the result object's hash attribute. So binding
a hypothetical variable like C<$digits> within a rule actually sets the
C<$0{digits}> element of the rule's result object.
So, for example, to distinguish the two line numbers within a
line range:
rule linerange {
<linenum> , <linenum>
| <linenum>
}
we could bind them to two separate hypothetical variables -- say,
C<$from> and C<$to> -- like so:
rule linerange {
(<linenum>) # Match linenum and capture result as $1
{ let $from := $1 } # Save result as hypothetical variable
, # Match comma
(<linenum>) # Match linenum and capture result as $2
{ let $to := $2 } # Save result as hypothetical variable
|
(<linenum>) # Match linenum and capture result as $3
{ let $from := $3 } # Save result as hypothetical variable
}
Now our result object has a hash entry C<$0{from}> and (maybe) one for
C<$0{to}> (if the first alternative was the one that matched). In fact,
we could I<ensure> that the result always has a C<$0{to}>, by
setting the corresponding hypothetical variable in the second
alternative as well:
rule linerange {
(<linenum>)
{ let $from := $1 }
,
(<linenum>)
{ let $to := $2 }
|
(<linenum>)
{ let $from := $3; let $to := $from }
}
Problem solved.
But only by introducing a new problem. All that hypothesizing made our
rule ugly and complex. So Perl 6 provides a much prettier short-hand:
rule linerange {
$from := <linenum> # Match linenum rule, bind result to $from
, # Match comma
$to := <linenum> # Match linenum rule, bind result to $to
| # Or...
$from := $to := <linenum> # Match linenum rule,
} # bind result to both $from and $to
or, more compactly:
rule linerange {
$from:=<linenum> , $to:=<linenum>
| $from:=$to:=<linenum>
}
If a Perl 6 rule contains a variable that is immediately followed by
the binding operator (C<:=>), that variable is never interpolated.
Instead, it is treated as a hypothetical variable, and bound to the
result of the next component of the rule (in the above examples,
to the result of the C<< <linenum> >> subrule match).
[Update: Any variable can be "hypotheticalized" via C<let>, so we
no longer call these hypotheticals. You bind to C<< $<from> >> to
capture a named field now, and may bind to C<$from> only if that is
an externally predeclared variable.]
You can also use hypothetical arrays and hashes, binding them to
a component that captures repeatedly. For example, we might choose
to name our set of hunks:
rule file { ^ @adonises := <hunk>* $ }
collecting all the C<< <hunk> >> matches into a single array
(which would then be available after the match as C<$0{'@adonises'}>. Note that
the sigil is included in the key in this case).
[Update: Now you bind to @<adonises>, and the sigil is not included.]
Or we might choose to bind a hypothetical hash:
rule config {
%init := # Hypothetically, bind %init to...
[ # Start of group
(<ident>) # Match and capture an identifier
\h*=\h* # Match an equals sign with optional horizontal whitespace
(\N*) # Match and capture the rest of the line
\n # Match the newline
]*
}
where each repetition of the C<[...]*> grouping captures
two substrings on each repetition and converts them to a key/value pair,
which is then added to the hash. The first captured substring in each
repetition becomes the key, and the second captured substring becomes
its associated value. The hypothetical C<%init> hash is also available
through the rule's result object, as C<$0{'%init'}> (again, with the
sigil as part of the key).
[Update: The current rules for all these bindings are detailed in S05.
=head1 The nesting instinct
Of course, those line number submatches in:
rule linerange {
$from:=<linenum> , $to:=<linenum>
| $from:=$to:=<linenum>
}
will have returned their own result objects. And it's a reference to those
nested result objects that actually gets stored in C<linerange>'s
C<$0{from}> and C<$0{to}>.
Likewise, in the next higher rule:
rule hunk :i {
[ <linenum> a :: <linerange> \n
<appendline>+
|
<linerange> d :: <linenum> \n
<deleteline>+
|
<linerange> c :: <linerange> \n
<deleteline>+
--- \n
<appendline>+
]
};
the match on C<< <linerange> >> will return I<its> C<$0> object.
So, within the C<hunk> rule, we could access the "from" digits
of the line range of the hunk as: C<$0{linerange}{from}>.
Likewise, at the highest level:
rule file { ^ <hunk>* $ }
we are matching a series of hunks, so the hypothetical C<$hunk>
variable (and hence C<$0{hunk}>) will contain a result object whose array
attribute contains the series of result objects returned by each
individual C<< <hunk> >> match.
So, for example, we could access the "from" digits of the line range of
the third hunk as: C<$0{hunk}[2]{linerange}{from}>.
[Update: C<< $<hunk>[2]<linerange><from> >> now.
=head1 Extracting the insertions
More usefully, we could locate and print every line in the diff that was
being inserted, regardless of whether it was inserted by an "append" or
a "change" hunk. Like so:
my $text is from($*ARGS);
if $text =~ /<Diff.file>/ {
for @{ $0{file}{hunk} } -> $hunk {
print @{$hunk{appendline}}
if $hunk{appendline};
}
}
[Update: The C<@{...}> construct is not C<@(...)>. A C<.each> would
probably be more readable though.]
Here, the C<if> statement attempts to match the text against the pattern
for a diff file. If it succeeds, the C<for> loop grabs the C<< <hunk>* >>
result object, treats it as an array, and then iterates each hunk match
object in turn into C<$hunk>. The array of append lines for each hunk
match is then printed (if there is in fact a reference to that array in
the hunk).
=head1 Don't just match there; do something!
Because Perl 6 patterns can have arbitrary code blocks inside them, it's
easy to have a pattern actually perform syntax transformations whilst
it's parsing. That's often a useful technique because it allows
us to manipulate the various parts of a hierarchical representation
locally (within the rules that recognize them).
For example, suppose we wanted to "reverse" the diff file. That is,
suppose we had a diff that specified the changes required to
transform file A to file B, but we needed the back-transformation
instead: from file B to file A. That's relatively easy to create.
We just turn every "append" into a "delete", every "delete" into
an "append", and reverse every "change".
The following code does exactly that:
grammar ReverseDiff {
rule file { ^ <hunk>* $ }
rule hunk :i {
[ <linenum> a :: <linerange> \n
<appendline>+
{ @$appendline =~ s/<in_marker>/< /;
let $0 := "${linerange}d${linenum}\n"
_ join "", @$appendline;
}
|
<linerange> d :: <linenum> \n
<deleteline>+
{ @$deleteline =~ s/<out_marker>/> /;
let $0 := "${linenum}a${linerange}\n"
_ join "", @$deleteline;
}
|
$from:=<linerange> c :: $to:=<linerange> \n
<deleteline>+
--- \n
<appendline>+
{ @$appendline =~ s/<in_marker>/</;
[Update: Smartmatch is now C<~~>.]
@$deleteline =~ s/<out_marker>/>/;
let $0 := "${to}c${from}\n"
[Update: That C<${to}> syntax is very, very dead now. Use {$to} instead.]
_ join("", @$appendline)
[Update: String concatenation is now C<~>. And just say C<$appendline.join>
to join with null strings.]
_ "---\n"
_ join("", @$deleteline);
}
]
|
<badline("Invalid diff hunk")>
}
rule badline ($errmsg) { (\N*) ::: { fail "$errmsg: $1" } }
rule linerange { $from:=<linenum> , $to:=<linenum>
| $from:=$to:=<linenum>
}
rule linenum { (\d+) }
rule deleteline { ^^ <out_marker> (\N* \n) }
rule appendline { ^^ <in_marker> (\N* \n) }
rule out_marker { \< <sp> }
rule in_marker { \> <sp> }
}
# and later...
my $text is from($*ARGS);
print @{ $0{file}{hunk} }
if $text =~ /<Diff.file>/;
The rule definitions for C<file>, C<badline>, C<linerange>, C<linenum>,
C<appendline>, C<deleteline>, C<in_marker> and C<out_marker> are exactly
the same as before.
All the work of reversing the diff is performed in the C<hunk> rule.
To do that work, we have to extend each of the three main alternatives
of that rule, adding to each a closure that changes the result object
it returns.
[Update: Now you can just return the new scalar result object with a
C<return> from the closure.]
=head1 Smarter alternatives
In the first alternative (which matches "append" hunks), we match as
before:
<linenum> a :: <linerange> \n
<appendline>+
But then we execute an embedded closure:
{ @$appendline =~ s/<in_marker>/</;
let $0 := "${linerange}d${linenum}\n"
_ join "", @$appendline;
}
The first line reverses the "marker" arrows on each line of data that
was previously being appended, using the smart-match operator to
apply the transformation to each line. Note too, that we reuse the
C<in_marker> rule within the substitution.
Then we bind the result object (i.e. the hypothetical variable C<$0>) to
a string representing the "reversed" append hunk. That is, we reverse the
order of the line range and line number components, put a C<'d'> (for "delete")
between them, and then follow that with all the reversed data:
let $0 := "${linerange}d${linenum}\n"
_ join "", @$appendline;
The changes to the "delete" alternative are exactly symmetrical.
Capture the components as before, reverse the marker arrows,
reverse the C<$linerange> and C<$linenum>, change the C<'d'> to an C<'a'>,
and append the reversed data lines.
In the third alternative:
$from:=<linerange> c :: $to:=<linerange> \n
<deleteline>+
--- \n
<appendline>+
{ @$appendline =~ s/<in_marker>/</;
@$deleteline =~ s/<out_marker>/>/;
let $0 := "${to}c${from}\n"
_ join("", @$appendline)
_ "---\n"
_ join("", @$deleteline);
}
there are line ranges on both sides of the C<'c'>. So we
need to give them distinct names, by binding them to extra hypothetical
variables: C<$from> and C<$to>. We then reverse the order of two line
ranges, but leave the C<'c'> as it was (because we're simply changing
something back to how it was previously). The markers on both the append
and delete lines are reversed, and then the order of the two sets of
lines is also reversed.
Once those transformations has been performed on each hunk (i.e. as it's
being matched!), the result of successfully matching any C<< <hunk> >>
subrule will be a string in which the matched hunk has already been
reversed.
All that remains is to match the text against the grammar, and
print out the (modified) hunks:
print @{ $0{file}{hunk} }
if $text =~ /<ReverseDiff.file>/;
And, since the C<file> rule is now in the ReverseDiff grammar's namespace,
we need to call the rule through that grammar. Note the way the
syntax for doing that continues the parallel with methods and classes.
=head1 Rearranging the deck-chairs...
It might have come as a surprise that we were allowed to bind
the pattern's C<$0> result object directly, but there's nothing magical
about it. C<$0> turns out to be just another hypothetical variable...the
one that happens to be returned when the match is complete.
[Update: For C<$0> read C<$/> here.]
Likewise, C<$1>, C<$2>, C<$3>, etc. are all hypotheticals, and can also be
explicitly bound in a rule. That's very handy for ensuring that the
right substring always turns up in the right numbered variable. For
example, consider a Perl 6 rule to match simple Perl 5 method calls
(matching I<all> Perl 5 method calls would, of course, require a much
more sophisticated rule):
rule method_call :w {
# Match direct syntax: $var->meth(...)
\$ (<ident>) -\> (<ident>) \( (<arglist>) \)
| # Match indirect syntax: meth $var (...)
(<ident>) \$ (<ident>) [ \( (<arglist>) \) | (<arglist>) ]
}
my ($varname, methodname, $arglist);
if ($source_code =~ / $0 := <method_call> /) {
$varname = $1 // $5;
$methodname = $2 // $4;
$arglist = $3 // $6 // $7;
}
By binding the match's C<$0> to the result of the C<< <method_call> >>
subrule, we bind its C<$0[1]>, C<$0[2]>, C<$0[3]>, etc. to those array
elements in C<< <method_call> >>'s result object. And thereby bind
C<$1>, C<$2>, C<$3>, etc. as well. Then it's just a matter of sorting
out which numeric variable ended up with which bit of the method call.
That's okay, but it would be much better if we could guarantee that
the variable name was always in C<$1>, the method name in C<$2>,
and the argument list in C<$3>. Then we could replace the last six lines
above with just:
my ($varname, methodname, $arglist) =
$source_code =~ / $0 := <method_call> /;
In Perl 5 there was no way to do that, but in Perl 6 it's relatively easy.
We just modify the C<method_call> rule like so:
rule method_call :w {
\$ $1:=<ident> -\> $2:=<ident> \( $3:=<arglist> \)
| $2:=<ident> \$ $1:=<ident> [ \( $3:=<arglist> \) | $3:=<arglist> ]
}
Or, annotated:
rule method_call :w {
\$ # Match a literal $
$1:=<ident> # Match the varname, bind it to $1
-\> # Match a literal ->
$2:=<ident> # Match the method name, bind it to $2
\( # Match an opening paren
$3:=<arglist> # Match the arg list, bind it to $3
\) # Match a closing paren
| # Or
$2:=<ident> # Match the method name, bind it to $2
\$ # Match a literal $
$1:=<ident> # Match the varname, bind it to $1
[ # Either...
\( $3:=<arglist> \) # Match arg list in parens, bind it to $3
| # Or...
$3:=<arglist> # Just match arg list, bind it to $3
]
}
Now the rule's C<$1> is bound to the variable name, regardless of which
alternative matches. Likewise C<$2> is bound to the method name in
either branch of the C<|>, and C<$3> is associated with the argument
list, no matter which of the I<three> possible ways it was matched.
Of course, that's still rather ugly (especially if we have to write all those
comments just so others can understand how clever we were).
So an even better solution is just to use proper named rules (with their
handy auto-capturing behaviour) for everything. And then slice the required
information out of the result object's hash attribute:
rule varname { <ident> }
rule methodname { <ident> }
rule method_call :w {
\$ <varname> -\> <methodname> \( <arglist> \)
| <methodname> \$ <varname> [ \( <arglist> \) | <arglist> ]
}
$source_code =~ / <method_call> /;
my ($varname, $methodname, $arglist) =
$0{method_call}{"varname","methodname","arglist"}
[Update: Make that C<< $<method_call><varname methodname arglist> >> now.]
=head1 Deriving a benefit
As the above examples illustrate, using named rules in grammars provides
a cleaner syntax and a reduction in the number of variables required in a
parsing program. But, beyond those advantages, and the obvious benefits
of moving rule construction from run-time to compile-time, there's yet
another significant way to gain from placing named rules inside a grammar: we
can I<inherit> from them.
For example, the ReverseDiff grammar is almost the same as the
normal Diff grammar. The only difference is in the C<hunk> rule.
So there's no reason why we shouldn't just have ReverseDiff inherit
all that sameness, and simply redefine its notion of C<hunk>-iness.
That would look like this:
grammar ReverseDiff is Diff {
rule hunk :i {
[ <linenum> a :: <linerange> \n
<appendline>+
{ $appendline =~ s/ <in_marker> /</;
let $0 := "${linerange}d${linenum}\n"
_ join "", @$appendline;
}
|
<linerange> d :: <linenum> \n
<deleteline>+
{ $deleteline =~ s/ <out_marker> />/;
let $0 := "${linenum}a${linerange}\n"
_ join "", @$deleteline;
}
|
$from:=<linerange> c :: $to:=<linerange> \n
<deleteline>+
--- \n
<appendline>+
{ $appendline =~ s/ <in_marker> /</;
$deleteline =~ s/ <out_marker> />/;
let $0 := "${to}c${from}\n"
_ join("", @$appendline)
_ "---\n"
_ join("", @$deleteline);
}
]
|
<badline("Invalid diff hunk")>
}
}
The C<ReverseDiff is Diff> syntax is the standard Perl 6 way of
inheriting behaviour. Classes will use the same notation:
class Hacker is Programmer {...}
class JAPH is Hacker {...}
# etc.
Likewise, in the above example Diff is specified as the base grammar
from which the new ReverseDiff grammar is derived. As a result of that
inheritance relationship, ReverseDiff immediately inherits all of the
Diff grammar's rules. We then simple redefine ReverseDiff's version of
the C<hunk> rule, and the job's done.
=head1 Different diffs
Grammatical inheritance isn't only useful for tweaking the behaviour of
a grammar's rules. It's also handy when two or more related grammars
share some characteristics, but differ in some particulars. For example,
suppose we wanted to support the "unified" diff format, as well as the
"classic".
A unified diff consists of two lines of header information, followed by
a series of hunks. The header information indicates the name and
modification date of the old file (prefixing the line with three minus
signs), and then the name and modification date of the new file
(prefixing that line with three plus signs). Each hunk consists of an
offset line, followed by one or more lines representing either shared
context, or a line to be inserted, or a line to be deleted. Offset lines
start with two "at" signs, then consist of a minus sign followed by the
old line offset and line-count, and then a plus sign followed by the
new line offset and line-count, and then two more "at" signs. Context
lines are prefixed with two spaces. Insertion lines are prefixed with a
plus sign and a space. Deletion lines are prefixed with a minus sign
and a space.
But that's not important right now.
What I<is> important is that we could write another complete grammar for
that, like so:
grammar Diff::Unified {
rule file { ^ <fileinfo> <hunk>* $ }
rule fileinfo {
<out_marker><3> $oldfile:=(\S+) $olddate:=[\h* (\N+?) \h*?] \n
[Update: The generalized repetition is now C<**{$n..$m}>, so the C<< <3> >>
would now be written C<**{3}>.]
<in_marker><3> $newfile:=(\S+) $newdate:=[\h* (\N+?) \h*?] \n
}
rule hunk {
<header>
@spec := ( <contextline>
| <appendline>
| <deleteline>
| <badline("Invalid line for unified diff")>
)*
}
rule header {
\@\@ <out_marker> <linenum> , <linecount> \h+
<in_marker> <linenum> , <linecount> \h+
\@\@ \h* \n
}
rule badline ($errmsg) { (\N*) ::: { fail "$errmsg: $1" } }
rule linenum { (\d+) }
rule linecount { (\d+) }
rule deleteline { ^^ <out_marker> (\N* \n) }
rule appendline { ^^ <in_marker> (\N* \n) }
rule contextline { ^^ <sp> <sp> (\N* \n) }
rule out_marker { \+ <sp> }
rule in_marker { - <sp> }
}
That represents (and can parse) the new diff format correctly,
but it's a needless duplication of effort and code. Many the rules of
this grammar are identical to those of the original diff parser. Which
suggests we could just grab them straight from the original -- by
inheriting them:
grammar Diff::Unified is Diff {
rule file { ^ <fileinfo> <hunk>* $ }
rule fileinfo {
<out_marker><3> $newfile:=(\S+) $olddate:=[\h* (\N+?) \h*?] \n
<in_marker><3> $newfile:=(\S+) $newdate:=[\h* (\N+?) \h*?] \n
}
rule hunk {
<header>
@spec := ( <contextline>
| <appendline>
| <deleteline>
| <badline("Invalid line for unified diff")>
)*
}
rule header {
\@\@ <out_marker> <linenum> , <linecount> \h+
<in_marker> <linenum> , <linecount> \h+
\@\@ \h* \n
}
rule linecount { (\d+) }
rule contextline { ^^ <sp> <sp> (\N* \n) }
rule out_marker { \+ <sp> }
rule in_marker { - <sp> }
}
Note that in this version we don't need to specify the rules for
C<appendline>, C<deleteline>, C<linenum>, etc. They're provided
automagically by inheriting from the C<Diff> grammar. So we only have to
specify the parts of the new grammar that differ from the original.
In particular, this is where we finally reap the reward for factoring
out the C<in_marker> and C<out_marker> rules. Because we did that
earlier, we can now just change the rules for matching those two markers
directly in the new grammar. As a result, the inherited C<appendline> and
C<deleteline> rules (which use C<in_marker> and C<out_marker> as
subrules) will now attempt to match the new versions of C<in_marker> and
C<out_marker> rules instead.
And if you're thinking that looks suspiciously like polymorphism, you're
absolutely right. The parallels between pattern matching and OO run
I<very> deep in Perl 6.
=head1 Let's get cooking
To sum up: Perl 6 patterns and grammars extend Perl's text matching
capacities enormously. But you don't have to start using all that extra
power right away. You can ignore grammars and embedded closures and
assertions and the other sophisticated bits until you actually need them.
The new rule syntax also cleans up much of the "line-noise" of Perl 5
regexes. But the fundamentals don't change that much. Many Perl 5
patterns will translate very simply and naturally to Perl 6.
To demonstrate that, and to round out this exploration of Perl 6
patterns, here are a few common Perl 5 regexes -- some borrowed from
the I<Perl Cookbook>, and others from the Regexp::Common module --
all ported to equivalent Perl 6 rules:
=over
=item Match a C comment:
# Perl 5
$str =~ m{ /\* .*? \*/ }xs;
# Perl 6
$str =~ m{ /\* .*? \*/ };
[Update: Use C<~~> now.]
=item Remove leading qualifiers from a Perl identifier
# Perl 5
$ident =~ s/^(?:\w*::)*//;
# Perl 6
$ident =~ s/^[\w*\:\:]*//;
=item Warn of text with lines greater than 80 characters
# Perl 5
warn "Thar she blows!: $&"
if $str =~ m/.{81,}/;
# Perl 6
warn "Thar she blows!: $0"
if $str =~ m/\N<81,>/;
[Update:
warn "Thar she blows!: $()"
if $str ~~ m/\N**{81,}/;
]
=item Match a Roman numeral
# Perl 5
$str =~ m/ ^ m* (?:d?c{0,3}|c[dm]) (?:l?x{0,3}|x[lc]) (?:v?i{0,3}|i[vx]) $ /ix;
# Perl 6
$str =~ m:i/ ^ m* [d?c<0,3>|c<[dm]>] [l?x<0,3>|x<[lc]>] [v?i<0,3>|i<[vx]>] $ /;
[Update: C<**{0..3}> now.]
=item Extract lines regardless of line terminator
# Perl 5
push @lines, $1
while $str =~ m/\G([^\012\015]*)(?:\012\015?|\015\012?)/gc;
# Perl 6
push @lines, $1
while $str =~ m:c/ (\N*) \n /;
=item Match a quote-delimited string (Friedl-style), capturing contents:
# Perl 5
$str =~ m/ " ( [^\\"]* (?: \\. [^\\"]* )* ) " /x;
# Perl 6
$str =~ m/ " ( <-[\\"]>* [ \\. <-[\\"]>* ]* ) " /;
=item Match a decimal IPv4 address:
# Perl 5
my $quad = qr/(?: 25[0-5] | 2[0-4]\d | [0-1]??\d{1,2} )/x;
$str =~ m/ $quad \. $quad \. $quad \. $quad /x;
# Perl 6
rule quad { (\d<1,3>) :: { fail unless $1 < 256 } }
$str =~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /;
# Perl 6 (same great approach, now less syntax)
rule quad { (\d<1,3>) :: <($1 < 256)> }
$str =~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /;
[Update: Now written:
token quad { (\d**{1..3}) <?{ $1 < 256 }> }
$str ~~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /;
]
=item Match a floating-point number, returning components:
# Perl 5
($sign, $mantissa, $exponent) =
$str =~ m/([+-]?)([0-9]+\.?[0-9]*|\.[0-9]+)(?:e([+-]?[0-9]+))?/;
# Perl 6
($sign, $mantissa, $exponent) =
$str =~ m/(<[+-]>?)(<[0-9]>+\.?<[0-9]>*|\.<[0-9]>+)[e(<[+-]>?<[0-9]>+)]?/;
=item Match a floating-point number I<maintainably>, returning components:
# Perl 5
my $digit = qr/[0-9]/;
my $sign_pat = qr/(?: [+-]? )/x;
my $mant_pat = qr/(?: $digit+ \.? $digit* | \. digit+ )/x;
my $expo_pat = qr/(?: $signpat $digit+ )? /x;
($sign, $mantissa, $exponent) =
$str =~ m/ ($sign_pat) ($mant_pat) (?: e ($expo_pat) )? /x;
# Perl 6
rule sign { <[+-]>? }
rule mantissa { <digit>+ [\. <digit>*] | \. <digit>+ }
rule exponent { [ <sign> <digit>+ ]? }
($sign, $mantissa, $exponent) =
$str =~ m/ (<sign>) (<mantissa>) [e (<exponent>)]? /;
=item Match nested parentheses:
# Perl 5
our $parens = qr/ \( (?: (?>[^()]+) | (??{$parens}) )* \) /x;
$str =~ m/$parens/;
# Perl 6
$str =~ m/ \( [ <-[()]> + : | <self> ]* \) /;
=item Match nested parentheses I<maintainably>:
# Perl 5
our $parens = qr/
\( # Match a literal '('
(?: # Start a non-capturing group
(?> # Never backtrack through...
[^()] + # Match a non-paren (repeatedly)
) # End of non-backtracking region
| # Or
(??{$parens}) # Recursively match entire pattern
)* # Close group and match repeatedly
\) # Match a literal ')'
/x;
$str =~ m/$parens/;
# Perl 6
$str =~ m/ <'('> # Match a literal '('
[ # Start a non-capturing group
<-[()]> + # Match a non-paren (repeatedly)
: # ...and never backtrack that match
| # Or
<self> # Recursively match entire pattern
]* # Close group and match repeatedly
<')'> # Match a literal ')'
/;
=back