Author image Marvin Humphrey
and 1 contributors

NAME

KinoSearch::Index::IndexReader - Read from an inverted index.

SYNOPSIS

    my $reader = KinoSearch::Index::IndexReader->open(
        invindex => MySchema->read('/path/to/invindex'),
    );

DESCRIPTION

IndexReader is the interface through which Searchers access the content of an InvIndex.

Point-in-time view of the invindex

IndexReader objects always represent a snapshot of an invindex as it existed at the moment the reader was created. If you want the search results to reflect modifications to an InvIndex, you must create a new IndexReader after the update process completes.

Caching a Searcher/Reader

When a IndexReader is created, a small portion of the InvIndex is loaded into memory; additional sort caches are filled as relevant queries arrive. For large document collections, the warmup time may become noticable, in which case reusing the reader is likely to speed up your search application.

Caching an IndexReader (or a Searcher which contains an IndexReader) is especially helpful when running a high-activity app in a persistent environment, as under mod_perl or FastCGI.

Read-locking on shared volumes

When a file is no longer in use by an index, InvIndexer attempts to delete it as part of a cleanup routine triggered by the call to finish(). It is possible that at the moment an InvIndexer attempts to delete files that it no longer thinks are needed, a Searcher is in fact using them. This is particularly likely in a persistent environment, where Searchers/IndexReaders are cached and reused.

Ordinarily, this is not is not a problem.

On a typical Unix volume, the file will be deleted in name only: any process which holds an open filehandle against that file will continue to have access, and the file won't actually get vaporized until the last filehandle is cleared. Thanks to "delete on last close semantics", an InvIndexer can't truly delete the file out from underneath an active Searcher.

On Windows, KinoSearch will attempt the file deletion, but an error will occur if any process holds an open handle. That's fine; InvIndexer runs these unlink() calls within an eval block, and if the attempt fails it will just try again the next time around.

On NFS, however, the system breaks, because NFS allows files to be deleted out from underneath an active process. Should this happen, the unlucky IndexReader will crash with a "Stale NFS filehandle" exception.

Under normal circumstances, it is neither necessary nor desirable for IndexReaders to secure read locks against an index, but for NFS we have to make an exception. KinoSearch::Store::LockFactory exists for this reason; supplying a LockFactory instance to IndexReader's constructor activates an internal locking mechanism and prevents concurrent indexing processes from deleting files that are needed by active readers.

LockFactory is implemented using lockfiles located in the index directory, so your reader applications must have write access. Stale lock files from crashed processes are ordinarily cleared away the next time the same machine -- as identified by the agent_id parameter supplied to LockFactory's constrctor -- opens another IndexReader. (The classic technique of timing out lock files does not work because search processes may lie dormant indefinitely.) However, please be aware that if the last thing a given machine does is crash, lock files belonging to it may persist, preventing deletion of obsolete index data.

FACTORY METHODS

open

    my $reader = KinoSearch::Index::IndexReader->open(
        invindex     => MySchema->read('/path/to/invindex'),
        lock_factory => $lock_factory,
    );

IndexReader is an abstract base class; open() functions like a constructor, but actually returns one of two possible subclasses: SegReader, which reads a single segment, and MultiReader, which channels the output of several SegReaders. Since each segment is a self-contained inverted index, a SegReader is in effect a complete index reader.

open() takes labeled parameters.

METHODS

max_doc

    my $max_doc = $reader->max_doc;

Returns one greater than the maximum document number in the invindex.

num_docs

    my $docs_available = $reader->num_docs;

Returns the number of documents currently accessible. Equivalent to max_doc() minus deletions.

COPYRIGHT

Copyright 2005-2007 Marvin Humphrey

LICENSE, DISCLAIMER, BUGS, etc.

See KinoSearch version 0.20.