-
-
18 Apr 2021 08:32:07 UTC
- Distribution: Lingua-StopWords
- Module version: 0.12
- Source (raw)
- Browse (raw)
- Changes
- Homepage
- How to Contribute
- Repository
- Issues
- Testers (444 / 0 / 0)
- Kwalitee
Bus factor: 1- 94.55% Coverage
- License: perl_5
- Perl: v5.6.0
- Activity
24 month- Tools
- Download (22.4KB)
- MetaCPAN Explorer
- Permissions
- Subscribe to distribution
- Permalinks
- This version
- Latest version
- Dependencies
- Encode
- Exporter
- Reverse dependencies
- CPAN Testers List
- Dependency graph
- NAME
- SYNOPSIS
- DESCRIPTION
- FUNCTIONS
- INSTALLATION
- SEE ALSO
- SOURCE REPOSITORY
- AUTHOR
- COPYRIGHT
- LICENSE
NAME
Lingua::StopWords - Stop words for several languages.
SYNOPSIS
use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('en'); my @words = qw( i am the walrus goo goo g'joob ); # prints "walrus goo goo g'joob" print join ' ', grep { !$stopwords->{$_} } @words;
DESCRIPTION
In keyword search, it is common practice to suppress a collection of "stopwords": words such as "the", "and", "maybe", etc. which exist in in a large number of documents and do not tell you anything important about any document which contains them. This module provides such "stoplists" in several languages.
Supported Languages
|-----------------------------------------------------------| | Language | ISO code | default encoding | also available | |-----------------------------------------------------------| | Danish | da | ISO-8859-1 | UTF-8 | | Dutch | nl | ISO-8859-1 | UTF-8 | | English | en | ISO-8859-1 | UTF-8 | | Finnish | fi | ISO-8859-1 | UTF-8 | | French | fr | ISO-8859-1 | UTF-8 | | German | de | ISO-8859-1 | UTF-8 | | Hungarian | hu | ISO-8859-2 | UTF-8 | | Indonesian | id | ISO-8859-1 | UTF-8 | | Italian | it | ISO-8859-1 | UTF-8 | | Norwegian | no | ISO-8859-1 | UTF-8 | | Portuguese | pt | ISO-8859-1 | UTF-8 | | Romanian | ro | ISO-8859-2 | UTF-8 | | Spanish | es | ISO-8859-1 | UTF-8 | | Swedish | sv | ISO-8859-1 | UTF-8 | | Russian | ru | KOI8-R | UTF-8 | |-----------------------------------------------------------|
FUNCTIONS
getStopWords
my $stoplist = getStopWords('en'); my $utf8_stoplist = getStopWords('en', 'UTF-8');
Retrieve a stoplist in the form of a hashref where the keys are all stopwords and the values are all 1.
$stoplist = { and => 1, if => 1, # ... };
getStopWords() expects 1-2 arguments. The first, which is required, is an ISO code representing a supported language. If the ISO code cannot be found, getStopWords returns undef.
The second argument should be 'UTF-8' if you want the stopwords encoded in UTF-8. The UTF-8 flag will be turned on, so make sure you understand all the implications of that.
INSTALLATION
To install this module type the following:
perl Build.PL ./Build ./Build test ./Build install
SEE ALSO
The stoplists supplied by this module were created as part of the Snowball project (see http://snowball.tartarus.org, Lingua::Stem::Snowball).
Lingua::EN::StopWords provides a different stoplist for English.
SOURCE REPOSITORY
https://github.com/wollmers/Lingua-StopWords
AUTHOR
Maintained by Helmut Wollmersdorfer <helmut@wollmersdorfer.at> and Marvin Humphrey <marvin at rectangular dot com>. Original author Fabien Potencier, <fabpot at cpan dot org>.
COPYRIGHT
Copyright 2021 Helmut Wollmersdorfer Copyright 2004-2008 Fabien Potencier, Marvin Humphrey
LICENSE
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.3 or, at your option, any later version of Perl 5 you may have available.
Module Install Instructions
To install Lingua::StopWords, copy and paste the appropriate command in to your terminal.
cpanm Lingua::StopWords
perl -MCPAN -e shell install Lingua::StopWords
For more information on module installation, please visit the detailed CPAN module installation guide.