-
-
15 Apr 2012 22:44:37 UTC
- Distribution: KinoSearch
- Source (raw)
- Browse (raw)
- Changes
- How to Contribute
- Issues (5)
- Testers (63 / 9 / 0)
- Kwalitee
Bus factor: 0- License: perl_5
- Activity
24 month- Tools
- Download (854.12KB)
- MetaCPAN Explorer
- Permissions
- Subscribe to distribution
- Permalinks
- This version
- Latest version
and 1 contributors- Marvin Humphrey <marvin at rectangular dot com>
- Dependencies
- JSON::XS
- Lingua::Stem::Snowball
- Lingua::StopWords
- Parse::RecDescent
- and possibly others
- Reverse dependencies
- CPAN Testers List
- Dependency graph
NAME
KinoSearch::Analysis::Tokenizer - Split a string into tokens.
DEPRECATED
The KinoSearch code base has been assimilated by the Apache Lucy project. The "KinoSearch" namespace has been deprecated, but development continues under our new name at our new home: http://lucy.apache.org/
SYNOPSIS
my $whitespace_tokenizer = KinoSearch::Analysis::Tokenizer->new( pattern => '\S+' ); # or... my $word_char_tokenizer = KinoSearch::Analysis::Tokenizer->new( pattern => '\w+' ); # or... my $apostrophising_tokenizer = KinoSearch::Analysis::Tokenizer->new; # Then... once you have a tokenizer, put it into a PolyAnalyzer: my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new( analyzers => [ $case_folder, $word_char_tokenizer, $stemmer ], );
DESCRIPTION
Generically, "tokenizing" is a process of breaking up a string into an array of "tokens". For instance, the string "three blind mice" might be tokenized into "three", "blind", "mice".
KinoSearch::Analysis::Tokenizer decides where it should break up the text based on a regular expression compiled from a supplied
pattern
matching one token. If our source string is..."Eats, Shoots and Leaves."
... then a "whitespace tokenizer" with a
pattern
of"\\S+"
produces...Eats, Shoots and Leaves.
... while a "word character tokenizer" with a
pattern
of"\\w+"
produces...Eats Shoots and Leaves
... the difference being that the word character tokenizer skips over punctuation as well as whitespace when determining token boundaries.
CONSTRUCTORS
new( [labeled params] )
my $word_char_tokenizer = KinoSearch::Analysis::Tokenizer->new( pattern => '\w+', # required );
pattern - A string specifying a Perl-syntax regular expression which should match one token. The default value is
\w+(?:[\x{2019}']\w+)*
, which matches "it's" as well as "it" and "O'Henry's" as well as "Henry".
INHERITANCE
KinoSearch::Analysis::Tokenizer isa KinoSearch::Analysis::Analyzer isa KinoSearch::Object::Obj.
COPYRIGHT AND LICENSE
Copyright 2005-2011 Marvin Humphrey
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Module Install Instructions
To install KSx::Simple, copy and paste the appropriate command in to your terminal.
cpanm KSx::Simple
perl -MCPAN -e shell install KSx::Simple
For more information on module installation, please visit the detailed CPAN module installation guide.