Author image Nathan Gary Glenn
and 3 contributors


Algorithm::AM::DataSet - Manage data used by Algorithm::AM


version 3.12


 use Algorithm::AM::DataSet 'dataset_from_file';
 use Algorithm::AM::DataSet::Item 'new_item';
 my $dataset = Algorithm::AM::DataSet->new(cardinality => 10);
 # or
 $dataset = dataset_from_file(path => 'finnverb', format => 'nocommas');
   new_item(features => [qw(a b c d e f g h i)]));
 my $item = $dataset->get_item(2);


This package contains a list of items that can be used by Algorithm::AM or Algorithm::AM::Batch for classification. DataSets can be made one item at a time via the "add_item" method, or they can be read from files via the "dataset_from_file" function.


Creates a new DataSet object. You must provide a cardinality argument indicating the number of features to be contained in each data vector. You can then add items via the add_item method. Each item will contain a feature vector, and also optionally a class label and a comment (also called a "spec").


Returns the number of features contained in the feature vector of a single item.


Returns the number of items in the data set.


Returns the list of all unique class labels in the data set.


Adds a new item to the data set. The input may be either an Algorithm::AM::DataSet::Item object, or the arguments to create one via its constructor (features, class, comment). This method will croak if the cardinality of the item does not match "cardinality".


Return the item at the given index. This will be a Algorithm::AM::DataSet::Item object.


Returns the number of different classification labels contained in the data set.


This function may be exported. Given 'path' and 'format' arguments, it reads a file containing a dataset and returns a new DataSet object with the given data. The 'path' argument should be the path to the file. The 'format' argument should be 'commas' or 'nocommas', indicating one of the following formats. You may also specify 'unknown' and 'null' arguments to indicate the strings meant to represent an unknown class value and null feature values. By default these are 'UNK' and '='.

The 'commas' file format is shown below:

 class , f eat u re s , your comment here

The commas separate the class label, feature values, and comments, and the whitespace around the commas is optional. Each feature value is separated with whitespace.

The 'nocommas' file format is shown below:

 class   features  your comment here

Here the class, feature values, and comments are separated by whitespace. Each feature value must be a single character with no separating characters, so here the features are f, e, a, t, u, r, e, and s.

Lines beginning with a pound character (#) are ignored.


For information on creating data sets, see the appendices in the "red book", Analogical Modeling: An exemplar-based approach to language. See also the "green book", Analogical Modeling of Language, for an explanation of the method in general, and the "blue book", Analogy and Structure, for its mathematical basis.


Theron Stanford <>, Nathan Glenn <>


This software is copyright (c) 2021 by Royal Skousen.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.