# Auto-generated file -- DO NOT EDIT!!!!! =head1 NAME KinoSearch::Analysis::Tokenizer - Split a string into tokens. =head1 DEPRECATED The KinoSearch code base has been assimilated by the Apache L project. The "KinoSearch" namespace has been deprecated, but development continues under our new name at our new home: L =head1 SYNOPSIS my $whitespace_tokenizer = KinoSearch::Analysis::Tokenizer->new( pattern => '\S+' ); # or... my $word_char_tokenizer = KinoSearch::Analysis::Tokenizer->new( pattern => '\w+' ); # or... my $apostrophising_tokenizer = KinoSearch::Analysis::Tokenizer->new; # Then... once you have a tokenizer, put it into a PolyAnalyzer: my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new( analyzers => [ $case_folder, $word_char_tokenizer, $stemmer ], ); =head1 DESCRIPTION Generically, "tokenizing" is a process of breaking up a string into an array of "tokens". For instance, the string "three blind mice" might be tokenized into "three", "blind", "mice". KinoSearch::Analysis::Tokenizer decides where it should break up the text based on a regular expression compiled from a supplied C<< pattern >> matching one token. If our source string is... "Eats, Shoots and Leaves." ... then a "whitespace tokenizer" with a C<< pattern >> of C<< "\\S+" >> produces... Eats, Shoots and Leaves. ... while a "word character tokenizer" with a C<< pattern >> of C<< "\\w+" >> produces... Eats Shoots and Leaves ... the difference being that the word character tokenizer skips over punctuation as well as whitespace when determining token boundaries. =head1 CONSTRUCTORS =head2 new( I<[labeled params]> ) my $word_char_tokenizer = KinoSearch::Analysis::Tokenizer->new( pattern => '\w+', # required ); =over =item * B - A string specifying a Perl-syntax regular expression which should match one token. The default value is C<< \w+(?:[\x{2019}']\w+)* >>, which matches "it's" as well as "it" and "O'Henry's" as well as "Henry". =back =head1 INHERITANCE KinoSearch::Analysis::Tokenizer isa L isa L. =head1 COPYRIGHT AND LICENSE Copyright 2005-2011 Marvin Humphrey This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut