-
-
11 Jan 2004 13:13:35 UTC
- Distribution: Lingua-ZH-Toke
- Module version: 0.02
- Source (raw)
- Browse (raw)
- Changes
- How to Contribute
- Issues
- Testers (0 / 0 / 1)
- Kwalitee
Bus factor: 1- 75.00% Coverage
- License: unknown
- Activity
24 month- Tools
- Download (5.98KB)
- MetaCPAN Explorer
- Permissions
- Subscribe to distribution
- Permalinks
- This version
- Latest version
- Dependencies
- Lingua::ZH::TaBE
- and possibly others
- Reverse dependencies
- CPAN Testers List
- Dependency graph
NAME
Lingua::ZH::Toke - Chinese Tokenizer
VERSION
This document describes version 0.02 of Lingua::ZH::Toke, released January 11, 2004.
SYNOPSIS
use Lingua::ZH::Toke; # -- if inputs are unicode strings, use the two lines below instead # use utf8; # use Lingua::ZH::Toke 'utf8'; # Create Lingua::ZH::Toke::Sentence object (->Sentence also works) my $token = Lingua::ZH::Toke->new( '那人卻在/燈火闌珊處/益發意興闌珊' ); # Easy tokenization via array deferencing print $token->[0] # Fragment - 那人卻在 ->[2] # Phrase - 卻在 ->[0] # Character - 卻 ->[0] # Pronounciation - ㄑㄩㄝˋ ->[2]; # Phonetic - ㄝ # Magic histogram via hash deferencing print $token->{'那人卻在'}; # 1 - One such fragment there print $token->{'意興闌珊'}; # 1 - One such phrase there print $token->{'發意興闌'}; # undef - That's not a phrase print $token->{'珊'}; # 2 - Two such character there print $token->{'ㄧˋ'}; # 2 - Two such pronounciation: 益意 print $token->{'ㄨ'}; # 3 - Three such phonetics: 那火處 # Iteration over fragments while (my $fragment = <$token>) { # Iteration over phrases while (my $phrase = <$fragment>) { # ... } }
DESCRIPTION
This module puts a thin wrapper around Lingua::ZH::TaBE, by blessing refereces to TaBE's objects into its English counterparts.
Besides offering more readable class names, this module also offers various overloaded methods for tokenization; please see "SYNOPSIS" for the three major ones.
Since Lingua::ZH::TaBE is a Big5-oriented module, we also provide a simple utf8 layer around it; if you have Perl version 5.6.1 or later, just use this:
use utf8; use Lingua::ZH::Toke 'utf8';
With the
utf8
flag set, all Toke objects will stringify to unicode strings, and constructors will take either unicode strings, or big5-encoded bytestrings.Note that on Perl 5.6.x, Encode::compat is needed for the
utf8
feature to work.METHODS
The constructor methods correspond to the six object levels:
->Sentence
,->Fragment
,->Phrase
,->Character
,->Pronounciation
and->Phonetic
. Each of them takes one string argument, representing the string to be tokenized.The
->new
method is an alias to->
Sentence>.All object methods, except
->new
, are passed to the underlying Lingua::ZH::TaBE object.CAVEATS
Under utf8 mode, you may sometimes need to explicitly stringify the return values, so their utf8 flag can be properly set:
$value = $token->[0]; # this may or may not work $value = "$token->[0]"; # this is guaranteed to work
This module does not care about efficiency or memory consumption yet, hence it's likely to fail miserably if you demand either of them. Patches welcome.
As the name suggests, the chosen interface is very bizzare. Use it at the risk of your own sanity.
SEE ALSO
Lingua::ZH::TaBE, Encode::compat, Encode
AUTHORS
Autrijus Tang <autrijus@autrijus.org>
COPYRIGHT
Copyright 2003, 2004 by Autrijus Tang <autrijus@autrijus.org>.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Module Install Instructions
To install Lingua::ZH::Toke, copy and paste the appropriate command in to your terminal.
cpanm Lingua::ZH::Toke
perl -MCPAN -e shell install Lingua::ZH::Toke
For more information on module installation, please visit the detailed CPAN module installation guide.