=head1 NAME

KinoSearch::Docs::FileFormat - Overview of invindex file format.


It is not necessary to understand the current implementation details of
KinoSearch's file format in order to use the library effectively, but it may
be helpful if you are interested in tweaking for high performance, exotic
usage, or debugging and development.  

On a file system, all the files in an invindex exist in one, flat directory.
Conceptually, the files have a hierarchical relationship: an invindex is made
up of "segments", each of which is an independent inverted index, and each
segment is made up of several subsections.

                                  |--[seg _0]--|
                                  |            |--[lexicon]
                                  |            |--[postings]
                                  |            |--[documents]
                                  |            |--[term vectors]
                                  |            |--[deletions]
                                  |--[seg _1]--|
                                  |            |--[lexicon]
                                  |            |--[postings]
                                  |            |--[documents]
                                  |            |--[term vectors]
                                  |            |--[deletions]
                                  |--[ ... ]---| 

=head1 Write-once philosophy

All the files which belong to a given segment share a common prefix consisting
of an underscore followed by a number in base 36: _0, _5m, _p9s2.  Once a
segment is finished and sent "live", its name is never re-used.

Segments are written once, and with the exception of the list of deleted
document numbers, are never modified once written.  Old segments become
obsolete and can be removed when their data has been consolidated into new
segments during the process of segment-merging and optimization.  A
fully-optimized index has only one segment.

The C<segments_XXX.yaml> file also utilizes the unique-base-36-number naming
convention -- "XXX" is a placeholder, so real files would be named e.g.
C<segments_3r.yaml>.  The higher the number, the more recent the file.  

There is only one file whose name gets re-used: C<write.lock>.  

Because all index file names other than C<write.lock> are unique, there is no
danger of overwriting an active file, as well as a greatly diminished need for
file locking in general.

=head1 Top-level files

There are two "top-level" files which belong to the entire invindex rather
than to a particular segment.  Their contents are stored in human-readable
L<YAML> to facilitate spelunking and debugging.  

=head2 segments_XXX.yaml

The C<segments_XXX.yaml> file stores information about which segments are
active and what they contain.  It is updated every time the invindex changes.

When a new segment is being written, KinoSearch may put files into the
invindex directory, but until a new segments_XXX.yaml file gets written, a
Searcher reading the index won't know about them.

=head2 write.lock

Only one indexing process may safely modify the invindex at any given time.
Processes reserve an invindex by laying claim to the C<write.lock> file.

=head1 A segment's component parts

Each segment can be said to have five logical parts: lexicon, postings,
document storage, term vectors, and deletions.

All segment files use non-human-readable binary formats, but they all adhere
to the same simple pattern: records stacked end-to-end-to-end.  Given
unlimited memory, any segment file may be read into an array by running a
single deserialization function against it over and over until the file ends.

=head2 Lexicon 

A segment's Lexicon is spread out over several files: one primary file and
one auxiliary file per indexed field.  


=item *

B<_XXX.lexYYY> - B<LEX>icon.  Enumerates all the indexed terms for a given
field in the segment.  Terms are sorted by lexical comparison of term text.

=item *

B<_XXX.lexxYYY> - B<LEX>icon IndeB<X>.  Houses periodic samples from the
main lexicon -- by default, every 128th term.  Each entry is augmented by a
file pointer into the main lexicon.


The files follow the naming pattern C<_XXX.lexYYY>, where "_XXX" is the
segment, and "YYY" is the field number.  So, if you were indexing plain text
files and you had two fields, only one of which was indexed...

    package MySchema::UnIndexed;
    use base qw( KinoSearch::Schema::FieldSpec );
    sub indexed  {0}
    sub analyzed {0}
    package MySchema;
    use base qw( KinoSearch::Schema );
    our %fields  = (
        content  => 'KinoSearch::Schema::FieldSpec',
        filepath => 'MySchema::UnIndexed',

... there would only be 1 pair of lexicon files per segment, "_XXX.lex0" and
"_XXX.lexx0", containing in this example the terms for the "content" field.

When a Searcher is launched, each .lexx file is decoded and cached as an
in-memory array of Terms.  To locate a term in the main lexicon, first a
binary search is performed against the lexicon-index array.  The results of
this search point to a small slice of the main on-disk lexicon file, which
is then scanned linearly.

=head2 Postings

L<Posting|KinoSearch::Posting> is a technical term from the field of
L<information retrieval|KinoSearch::Docs::IRTheory>, defined as a single
instance of a one term indexing one document.  If you are looking at the index
in the back of a book, and you see that "freedom" is referenced on pages 8,
86, and 240, that would be three postings, which taken together form a
"posting list".  The same terminology applies to an index in electronic form.

Each segment has one postings file per indexed field.  The file names follow
the same pattern as the lexicon files, but with a "p": C<_XXX.pYYY>.

When a search is performed for a single term, first that term is looked up in
the lexicon.  If the term exists in a segment, the record in the lexicon will
contain information about which postings file to look at and where to look.

The first thing any posting record tells you is a document number.  By
iterating over all the postings associated with a term, you can find all the
documents that match that term, a process which is analogous to looking up
page numbers in a book's index.  However, each posting record typically
contains other information in addition to document number, depending on what
subclass of Posting the field is spec'd to use: the number of times the term
appears within the field or "freq", the positions at which the term occurs
within the field, etc.

=head2 Documents

The document storage section is essentially an object-oriented database where
the "original" documents added to an invindex are housed.  It is organized into
two files:


=item * 

B<_XXX.ds> - B<D>ocument B<S>torage.  Serialized documents.

=item *

B<_XXX.dsx> - B<D>ocument B<S>torage IndeB<X>.  A solid array of 64-bit
integers.  Each integer location corresponds to a document number, and the
value at that location points at a file position in the _XXX.ds file.


When a document number turns up as a hit in a search and its associated
document must be retrieved, KinoSearch looks up the doc number in the _XXX.dsx
file to find the file pointer, then seeks to that location in the _XXX.ds file
and deserializes the document.

Not all information associated with the original document can be recovered at
search-time.  For instance, if a field is spec'd as C<indexed> but not
C<stored>, its terms will appear in the lexicon, but its original content
will I<not> be found along with the rest of the document in the relevant
_XXX.ds file.

=head2 Term Vectors

The files which store Term Vectors, used for excerpting and highlighting, are
organized similarly to the files used to store documents.


=item * 

B<_XXX.tv> - B<T>erm B<V>ectors.  1 array of terms per document.

=item *

B<_XXX.tvx> - B<T>erm B<V>ector IndeB<X>.  As with the C<_XXX.dsx> file, a 
solid array of 64-bit file pointers.


=head2 Deletions

When a document is "deleted" from a segment, it is not actually purged right
away; it is merely marked as "deleted", via the .del file.  The .del file
contains a bit vector with one bit for each document in the segment; if bit
#254 is set then document 254 is deleted, and if that document turns up in a
search it will be masked out.

It is only when a segment's contents are rewritten to a new segment during the
segment-merging process that deleted documents truly go away.

Deletions may be performed against a given segment several times during its
lifespan, thus .del filenames use I<two> base-36 numbers: C<_XXX_YYY.del>.
The first number corresponds to segment name, and the second to the file's
"generation", so the file "_aa_1.del" would supersede "_aa_0.del" as a list of
the deletions for segment "_aa".

=head1 Compound File

If you peer inside an index directory, you won't actually find any files with
the extensions C<lexYYY>, C<lexxYYY>, C<pYYY>, C<ds>, or C<dsx>, unless there
is a live indexing process underway.  What you will find is one file per
segment named "_XXX.cf". (cf => B<C>ompound B<F>ile)

To minimize the need for file descriptors at search-time, KinoSearch
concatenates all per-segment files onto each other at the close of each
indexing session.  Information about where each file begins and ends is stored
in C<segments_XXX.yaml>.  When the segment is opened for reading, a single
file descriptor per compound file gets shared by multiple InStream objects.

=head1 Document Numbers

Document numbers in KinoSearch are ephemeral.   They change every time a
document gets moved from one segment to a new one during segment merging and

If you need to assign a primary key to each document, you need to create a
field and populate it with an externally generated unique identifier.

=head1 A Typical Search

Here's a simplified narrative, dramatizing how a search for "freedom" against
a given segment plays out:


=item 1

The searcher asks the relevant Lexicon Index file, "Do you know anything about
'freedom'?"  Lexicon Index replies, "Can't say for sure, but if the main
Lexicon file does, 'freedom' is probably somewhere around byte 21008".  

=item 2

The main Lexicon tells the searcher "One moment, let me scan our records...
Yes, we have 2 documents which contain 'freedom'.  You'll find them in the
_6.p2 Postings file starting at byte 66991."

=item 3

The Postings file says "Yep, we have 'freedom', all right!  Document number 40
has 1 'freedom', and document 44 has 8.  If you need to know more, like if any
'freedom' is part of the phrase 'freedom of speech', ask me about positions!

=item 4

If the searcher is only looking for 'freedom' in isolation, that's where it
stops.  It now knows enough to assign the documents scores against "freedom",
with the 8-freedom document likely ranking higher than the single-freedom



Copyright 2005-2007 Marvin Humphrey


See L<KinoSearch> version 0.20.