-
-
04 Sep 2014 00:12:31 UTC
- Distribution: Lucy
- Source (raw)
- Browse (raw)
- Changes
- Homepage
- How to Contribute
- Clone repository
- Issues
- Testers (2 / 91 / 0)
- Kwalitee
Bus factor: 1- License: apache_2_0
- Perl: v5.8.3
- Activity
24 month- Tools
- Download (1.06MB)
- MetaCPAN Explorer
- Permissions
- Subscribe to distribution
- Permalinks
- This version
- Latest version
and 1 contributors- The Apache Lucy Project <dev at lucy dot apache dot org>
- Dependencies
- Clownfish
- and possibly others
- Reverse dependencies
- CPAN Testers List
- Dependency graph
NAME
Lucy::Docs::Tutorial::Analysis - How to choose and use Analyzers.
DESCRIPTION
Try swapping out the EasyAnalyzer in our Schema for a StandardTokenizer:
my $tokenizer = Lucy::Analysis::StandardTokenizer->new; my $type = Lucy::Plan::FullTextType->new( analyzer => $tokenizer, );
Search for
senate
,Senate
, andSenator
before and after making the change and re-indexing.Under EasyAnalyzer, the results are identical for all three searches, but under StandardTokenizer, searches are case-sensitive, and the result sets for
Senate
andSenator
are distinct.EasyAnalyzer
What's happening is that EasyAnalyzer is performing more aggressive processing than StandardTokenizer. In addition to tokenizing, it's also converting all text to lower case so that searches are case-insensitive, and using a "stemming" algorithm to reduce related words to a common stem (
senat
, in this case).EasyAnalyzer is actually multiple Analyzers wrapped up in a single package. In this case, it's three-in-one, since specifying a EasyAnalyzer with
language => 'en'
is equivalent to this snippet:my $tokenizer = Lucy::Analysis::StandardTokenizer->new; my $normalizer = Lucy::Analysis::Normalizer->new; my $stemmer = Lucy::Analysis::SnowballStemmer->new( language => 'en' ); my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( analyzers => [ $tokenizer, $normalizer, $stemmer ], );
You can add or subtract Analyzers from there if you like. Try adding a fourth Analyzer, a SnowballStopFilter for suppressing "stopwords" like "the", "if", and "maybe".
my $stopfilter = Lucy::Analysis::SnowballStopFilter->new( language => 'en', ); my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( analyzers => [ $tokenizer, $normalizer, $stopfilter, $stemmer ], );
Also, try removing the SnowballStemmer.
my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new( analyzers => [ $tokenizer, $normalizer ], );
The original choice of a stock English EasyAnalyzer probably still yields the best results for this document collection, but you get the idea: sometimes you want a different Analyzer.
When the best Analyzer is no Analyzer
Sometimes you don't want an Analyzer at all. That was true for our "url" field because we didn't need it to be searchable, but it's also true for certain types of searchable fields. For instance, "category" fields are often set up to match exactly or not at all, as are fields like "last_name" (because you may not want to conflate results for "Humphrey" and "Humphries").
To specify that there should be no analysis performed at all, use StringType:
my $type = Lucy::Plan::StringType->new; $schema->spec_field( name => 'category', type => $type );
Highlighting up next
In our next tutorial chapter, Lucy::Docs::Tutorial::Highlighter, we'll add highlighted excerpts from the "content" field to our search results.
Module Install Instructions
To install Lucy::Simple, copy and paste the appropriate command in to your terminal.
cpanm Lucy::Simple
perl -MCPAN -e shell install Lucy::Simple
For more information on module installation, please visit the detailed CPAN module installation guide.