-
-
27 Oct 2007 19:46:10 UTC
- Development release
- Distribution: KinoSearch
- Source (raw)
- Browse (raw)
- Changes
- How to Contribute
- Issues (5)
- Testers (21 / 5 / 2)
- Kwalitee
Bus factor: 0- License: perl_5
- Activity
24 month- Tools
- Download (460.82KB)
- MetaCPAN Explorer
- Permissions
- Subscribe to distribution
- Permalinks
- This version
- Latest version
and 1 contributors- Marvin Humphrey <marvin at rectangular dot com>
- Dependencies
- Compress::Zlib
- HTML::Parser
- Lingua::Stem::Snowball
- Lingua::StopWords
- and possibly others
- Reverse dependencies
- CPAN Testers List
- Dependency graph
NAME
KinoSearch::Analysis::Token - Unit of text.
SYNOPSIS
my $token = KinoSearch::Analysis::Token->new( text => 'horses', start_offset => 0, end_offset => 6, ); $token->set_text('hors');
DESCRIPTION
Token is the fundamental unit used by KinoSearch's Analyzer subclasses. Each Token has 5 attributes:
text - a UTF-8 string.
start_offset - The start point of the token text, measured in UTF-8 characters from the top of the stored field.
start_offset
andend_offset
locate the Token within a larger context, even if the Token's text attribute gets modified -- by stemming, for instance. The Token for "beating" in the text "beating a dead horse" begins life with a start_offset of 0 and an end_offset of 7; after stemming, the text is "beat", but the start_offset is still 0 and the end_offset is still 7. This allows "beating" to be highlighted correctly after a search matches "beat".end_offset The end of the token text, measured in UTF-8 characters from the top of the field.
boost - a per-token weight. Use this when you want to assign more or less importance to a particular token, as you might for emboldened text within an HTML document, for example. (Note: The field this token belongs to must be spec'd to use a posting_type of KinoSearch::Posting::RichPosting.)
pos_inc - POSition INCrement, measured in Tokens. This attribute, which defaults to 1, is a an advanced tool for manipulating phrase matching. Ordinarily, Tokens are assigned consecutive position numbers: 0, 1, and 2 for
"three blind mice"
. However, if you set the position increment for "blind" to, say, 1000, then the three tokens will end up assigned to positions 0, 1, and 1001 -- and will no longer produce a phrase match for the query"three blind mice"
.
METHODS
new
my $token = KinoSearch::Analysis::Token->new( text => $text, # required start_offset => 0, # required end_offset => length($text), # required boost => 100.0, # default 1.0 pos_inc => 0, # default 1 );
Constructor. Takes hash-style parameters, corresponding to the token's attributes.
Accessors
Token provides these set/get methods:
- set_text
- get_text
- set_start_offset
- get_start_offset
- set_end_offset
- get_end_offset
- set_boost
- get_boost
- set_pos_inc
- get_pos_inc
COPYRIGHT
Copyright 2006-2007 Marvin Humphrey
LICENSE, DISCLAIMER, BUGS, etc.
See KinoSearch version 0.20.
Module Install Instructions
To install KinoSearch, copy and paste the appropriate command in to your terminal.
cpanm KinoSearch
perl -MCPAN -e shell install KinoSearch
For more information on module installation, please visit the detailed CPAN module installation guide.