15 Apr 2012 22:44:37 UTC
- Distribution: KinoSearch
- Source (raw)
- Browse (raw)
- How to Contribute
- Issues (5)
- Testers (63 / 9 / 0)
- KwaliteeBus factor: 0
- License: perl_5
- Activity24 month
- Download (854.12KB)
- MetaCPAN Explorer
- Subscribe to distribution
- This version
- Latest versionCREAMYG Marvin Humphreyand 1 contributors
- Marvin Humphrey <marvin at rectangular dot com>
- COPYRIGHT AND LICENSE
KinoSearch::Analysis::PolyAnalyzer - Multiple Analyzers in series.
The KinoSearch code base has been assimilated by the Apache Lucy project. The "KinoSearch" namespace has been deprecated, but development continues under our new name at our new home: http://lucy.apache.org/
my $schema = KinoSearch::Plan::Schema->new; my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en', ); my $type = KinoSearch::Plan::FullTextType->new( analyzer => $polyanalyzer, ); $schema->spec_field( name => 'title', type => $type ); $schema->spec_field( name => 'content', type => $type );
A PolyAnalyzer is a series of Analyzers, each of which will be called upon to "analyze" text in turn. You can either provide the Analyzers yourself, or you can specify a supported language, in which case a PolyAnalyzer consisting of a CaseFolder, a Tokenizer, and a Stemmer will be generated for you.
en => English, da => Danish, de => German, es => Spanish, fi => Finnish, fr => French, hu => Hungarian, it => Italian, nl => Dutch, no => Norwegian, pt => Portuguese, ro => Romanian, ru => Russian, sv => Swedish, tr => Turkish,
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'es', ); # or... my $case_folder = KinoSearch::Analysis::CaseFolder->new; my $tokenizer = KinoSearch::Analysis::Tokenizer->new; my $stemmer = KinoSearch::Analysis::Stemmer->new( language => 'en' ); my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new( analyzers => [ $case_folder, $whitespace_tokenizer, $stemmer, ], );
language - An ISO code from the list of supported languages.
analyzers - An array of Analyzers. The order of the analyzers matters. Don't put a Stemmer before a Tokenizer (can't stem whole documents or paragraphs -- just individual words), or a Stopalizer after a Stemmer (stemmed words, e.g. "themselv", will not appear in a stoplist). In general, the sequence should be: normalize, tokenize, stopalize, stem.
Getter for "analyzers" member.
Copyright 2005-2011 Marvin Humphrey
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.