NAME

Text::TokenStream::Lexer - reusable lexer for token-stream scanning

SYNOPSIS

    my $lexer = Text::TokenStream::Lexer->new(
        whitespace => [qr/\s+/, qr/\# [^\n]* (?:\n|\z)/x],
        rules => [
            word => qr/\w+/,
            sym => qr/[^\w\s\#]+/,
        ],
    );

    my $token = $lexer->next_token(\$input_text);

DESCRIPTION

A lexer instance is constructed by specifying regexes that match individual parts of the input text. Each regex is associated with a token type that will be used to distinguish the tokens found. The regexes are tried in the order they're given in the "rules" attribute; this means, for example, that you can have a keyword rule that matches any of a list of specified keywords, followed by an identifier rule that matches arbitrary identifiers, even if keywords have the same syntax as identifiers.

(In actual fact, the regexes are preprocessed into a form that the regex engine can handle more easily, and only one regex match operation is performed to extract each token. This should be completely transparent to the caller.)

A lexer will attempt to skip whitespace before scanning each token; to do that, it uses a separate set of regexes, in the "whitespace" attribute.

CONSTRUCTOR

This class uses Moo, and inherits the standard new constructor.

ATTRIBUTES

rules

Required; read-only. Array ref of (identifier, rule) pairs: each rule is a regex (or a literal string), that will be matched at the current position in the input, and the preceding identifier will be used as the type of the token, if this rule matches.

If a rule regex has any named captures, the contents of those captures will be preserved in the value returned by "next_token".

The regexes will be implicitly anchored to the next match position in the string being examined, so you should not add any initial anchor.

It is the caller's responsibility to ensure that the rules match every possible input.

whitespace

Read-only; defaults to empty array ref. Array ref of rule pairs, where each rule is a regex (or literal string), that will be treated as whitespace. It will typically be a good idea to include comments (if needed in your language) in this attribute.

The regexes will be implicitly anchored to the next match position in the string being examined, so you should not add any initial anchor.

OTHER METHODS

next_token

Takes one argument, which is a reference to a string. First attempts to "skip_whitespace" on the referenced string, and returns undef if the string is empty after any whitespace. Then attempts to match each of the "rules" against the remaining part of the string. If no rule matches, throws an exception. Otherwise, returns a hashref containing the following elements:

type

The identifier corresponding to the rule that matched

text

The text matched by the regex

cuddled

A boolean value, true iff the token was not preceded by whitespace

captures

A hashref of any named captures matched by the regex

skip_whitespace

Takes one argument, which is a reference to a string. If none of the "whitespace" patterns match at the start of the referenced string, returns false. Otherwise, removes as many leading whitespace sequences as it can from the beginning of the referenced string, and returns true.

AUTHOR

Aaron Crane, <arc@cpan.org>

COPYRIGHT

Copyright 2021 Aaron Crane.

LICENCE

This library is free software and may be distributed under the same terms as perl itself. See http://dev.perl.org/licenses/.