19 Aug 2006 06:34:01 UTC
- Distribution: KinoSearch
- Source (raw)
- Browse (raw)
- How to Contribute
- Issues (5)
- Testers (1 / 0 / 0)
- KwaliteeBus factor: 0
- License: perl_5
- Activity24 month
- Download (212.29KB)
- MetaCPAN Explorer
- Subscribe to distribution
- This version
- Latest versionCREAMYG Marvin Humphreyand 1 contributors
- Marvin Humphrey <marvin at rectangular dot com>
KinoSearch::Analysis::Token - unit of text
# private class - no public API
You can't actually instantiate a Token object at the Perl level -- however, you can affect individual Tokens within a TokenBatch by way of TokenBatch's (experimental) API.
Token is the fundamental unit used by KinoSearch's Analyzer subclasses. Each Token has 4 attributes: text, start_offset, end_offset, and pos_inc (for position increment).
The text of a token is a string.
A Token's start_offset and end_offset locate it within a larger text, even if the Token's text attribute gets modified -- by stemming, for instance. The Token for "beating" in the text "beating a dead horse" begins life with a start_offset of 0 and an end_offset of 7; after stemming, the text is "beat", but the end_offset is still 7.
The position increment, which defaults to 1, is a an advanced tool for manipulating phrase matching. Ordinarily, Tokens are assigned consecutive position numbers: 0, 1, and 2 for "three blind mice". However, if you set the position increment for "blind" to, say, 1000, then the three tokens will end up assigned to positions 0, 1, and 1001 -- and will no longer produce a phrase match for the query '"three blind mice"'.
Copyright 2006 Marvin Humphrey
See KinoSearch version 0.13.