Author image Marvin Humphrey
and 1 contributors

NAME

KinoSearch::Analysis::Token - Unit of text.

SYNOPSIS

    my $token = KinoSearch::Analysis::Token->new(
        text         => 'horses',
        start_offset => 0,
        end_offset   => 6,
    );
    $token->set_text('hors');

DESCRIPTION

Token is the fundamental unit used by KinoSearch's Analyzer subclasses. Each Token has 5 attributes:

  • text - a UTF-8 string.

  • start_offset - The start point of the token text, measured in UTF-8 characters from the top of the stored field. start_offset and end_offset locate the Token within a larger context, even if the Token's text attribute gets modified -- by stemming, for instance. The Token for "beating" in the text "beating a dead horse" begins life with a start_offset of 0 and an end_offset of 7; after stemming, the text is "beat", but the start_offset is still 0 and the end_offset is still 7. This allows "beating" to be highlighted correctly after a search matches "beat".

  • end_offset The end of the token text, measured in UTF-8 characters from the top of the field.

  • boost - a per-token weight. Use this when you want to assign more or less importance to a particular token, as you might for emboldened text within an HTML document, for example. (Note: The field this token belongs to must be spec'd to store_pos_boost.)

  • pos_inc - POSition INCrement, measured in Tokens. This attribute, which defaults to 1, is a an advanced tool for manipulating phrase matching. Ordinarily, Tokens are assigned consecutive position numbers: 0, 1, and 2 for "three blind mice". However, if you set the position increment for "blind" to, say, 1000, then the three tokens will end up assigned to positions 0, 1, and 1001 -- and will no longer produce a phrase match for the query '"three blind mice"'.

METHODS

new

    my $token = KinoSearch::Analysis::Token->new(
        text         => $text,          # required 
        start_offset => 0,              # required 
        end_offset   => length($text),  # required
        boost        => 100.0,          # default 1.0
        pos_inc      => 0,              # default 1
    );

Constructor. Takes hash-style parameters, corresponding to the token's attributes.

Accessors

Token provides these set/get methods:

set_text
get_text
set_start_offset
get_start_offset
set_end_offset
get_end_offset
set_boost
get_boost
set_pos_inc
get_pos_inc

COPYRIGHT

Copyright 2006-2007 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch version 0.20.