Text::TokenStream::Lexer - reusable lexer for token-stream scanning
my $lexer = Text::TokenStream::Lexer->new( whitespace => [qr/\s+/, qr/\# [^\n]* (?:\n|\z)/x], rules => [ word => qr/\w+/, sym => qr/[^\w\s\#]+/, ], ); my $token = $lexer->next_token(\$input_text);
A lexer instance is constructed by specifying regexes that match individual parts of the input text. Each regex is associated with a token type that will be used to distinguish the tokens found. The regexes are tried in the order they're given in the "rules" attribute; this means, for example, that you can have a keyword rule that matches any of a list of specified keywords, followed by an identifier rule that matches arbitrary identifiers, even if keywords have the same syntax as identifiers.
"rules"
keyword
identifier
(In actual fact, the regexes are preprocessed into a form that the regex engine can handle more easily, and only one regex match operation is performed to extract each token. This should be completely transparent to the caller.)
A lexer will attempt to skip whitespace before scanning each token; to do that, it uses a separate set of regexes, in the "whitespace" attribute.
"whitespace"
This class uses Moo, and inherits the standard new constructor.
new
rules
Required; read-only. Array ref of (identifier, rule) pairs: each rule is a regex (or a literal string), that will be matched at the current position in the input, and the preceding identifier will be used as the type of the token, if this rule matches.
If a rule regex has any named captures, the contents of those captures will be preserved in the value returned by "next_token".
"next_token"
The regexes will be implicitly anchored to the next match position in the string being examined, so you should not add any initial anchor.
It is the caller's responsibility to ensure that the rules match every possible input.
whitespace
Read-only; defaults to empty array ref. Array ref of rule pairs, where each rule is a regex (or literal string), that will be treated as whitespace. It will typically be a good idea to include comments (if needed in your language) in this attribute.
next_token
Takes one argument, which is a reference to a string. First attempts to "skip_whitespace" on the referenced string, and returns undef if the string is empty after any whitespace. Then attempts to match each of the "rules" against the remaining part of the string. If no rule matches, throws an exception. Otherwise, returns a hashref containing the following elements:
"skip_whitespace"
undef
type
The identifier corresponding to the rule that matched
text
The text matched by the regex
cuddled
A boolean value, true iff the token was not preceded by whitespace
captures
A hashref of any named captures matched by the regex
skip_whitespace
Takes one argument, which is a reference to a string. If none of the "whitespace" patterns match at the start of the referenced string, returns false. Otherwise, removes as many leading whitespace sequences as it can from the beginning of the referenced string, and returns true.
Aaron Crane, <arc@cpan.org>
Copyright 2021 Aaron Crane.
This library is free software and may be distributed under the same terms as perl itself. See http://dev.perl.org/licenses/.
To install Text::TokenStream, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::TokenStream
CPAN shell
perl -MCPAN -e shell install Text::TokenStream
For more information on module installation, please visit the detailed CPAN module installation guide.