Author image Marvin Humphrey
and 1 contributors

The norm_decoder caches the 256 possible byte => float pairs, obviating the need to call decode_norm over and over for a scoring implementation that knows how to use it.


KinoSearch::Search::Similarity - Calculate how closely two things match.


    # ./
    package MySimilarity;

    sub length_norm { 
        my ( $self, $num_tokens ) = @_;
        return $num_tokens == 0 ? 1 : log($num_tokens) + 1;

    # ./
    package MySchema;
    use base qw( KinoSearch::Schema );
    use MySimilarity;
    sub similarity { MySimilarity->new }


KinoSearch uses a close approximation of boolean logic for determining which documents match a given query; then it uses a variant of the vector-space model for calculating scores. Much of the math used when calculating these scores is encapsulated within the Similarity class.

Similarity objects are are used internally by KinoSearch's indexing and scoring classes. They are assigned using KinoSearch::Schema and KinoSearch::Schema::FieldSpec.

Only one method is publicly exposed at present.


To build your own Similarity implmentation, provide a new implementation of length_norm() under a new class name. Similarity's internal constructor will inherit properly.

Similarity is implemented as a C-struct object, so you can't add any member variables to it.



    my $multiplier = $sim->length_norm($num_tokens);

After a field is broken up into terms at index-time, each term must be assigned a weight. One of the factors in calculating this weight is the number of tokens that the original field was broken into.

Typically, we assume that the more tokens in a field, the less important any one of them is -- so that, e.g. 5 mentions of "Kafka" in a short article are given more heft than 5 mentions of "Kafka" in an entire book. The default implementation of length_norm expresses this using an inverted square root.

However, the inverted square root has a tendency to reward very short fields highly, which isn't always appropriate for fields you expect to have a lot of tokens on average. See KSx::Search::LongFieldSim for a discussion.


Copyright 2005-2007 Marvin Humphrey


See KinoSearch version 0.20.