BuzzSaw::DataSource::Files - A BuzzSaw data source for a set of files.


This documentation refers to BuzzSaw::DataSource::Files version 0.12.0


  use BuzzSaw::DataSource::Files;

  my $source = BuzzSaw::DataSource::Files->new(
                       parser      => "RFC3339",
                       names       => ["*.log"],
                       directories => ["/var/log"],
                       recursive   => 0 );


  while ( defined ( my $entry = $source->next_entry ) ) {


This module provides a class which implements the BuzzSaw data source API. It can be used to search a large directory tree to find all log files with names which match specified patterns. The list of files will be walked and each line of each file will be returned in a continuous stream until the end is reached. This module considers each line to represent a separate log entry, no attempts are made to handle multi-line log entries.

This module can seamlessly handle a mixture of files compressed with either of gzip or bzip2 alongside standard plain-text files.

The module records each file once it has been completely parsed so that on the next run it can be ignored if it has not been altered.

The module uses locking so that multiple processes can run concurrently on the same set of log files without any problems associated with conflicting access requirements.

The BuzzSaw project provides a suite of tools for processing log file entries. Entries in files are parsed and filtered into a set of events of interest which are stored in a database. A report generation framework is also available which makes it easy to generate regular reports regarding the events discovered.


The following attributes are inherited from the DataSource::DataSource role which is implemented by this class. See the documentation for that role for full details.


The BuzzSaw::DB object.


The BuzzSaw::Parser object.


This is a boolean value which controls whether or not the contents of all files should be examined. If this is set to false (the default) then this class will not re-read files which have previously been examined and have not since been altered.

The following attributes are specific to this class.


If you wish to parse a set of specifically named log files then you can set this attribute to be a reference to an array of absolute filenames. More typically you will not know all the potential file names and will want to trawl for all log files which match some pattern within an entire directory. In that case you should ignore this attribute and use the names and directories attributes described below, this attribute will then be automatically populated by doing a filesystem search using the File::Find::Rule module. There is no default value. Note that you MUST specify either a set of names or a set of files when the files data source object is created.


This takes a list of strings and references to Perl regular expressions. Strings are considered to be simple POSIX-style file matches (e.g. *.log). If you need anything more complex then use the full power of Perl regular expressions (e.g. qr/^.*\.log(-\d+)?$/). A file will be included in the final list if ANY pattern matches, the rules are OR-ed not AND-ed together. You can specify as many patterns as you like and can mix the use of both styles. There is no default value. Note that you MUST specify either a set of names or a set of files when the files data source object is created.


This is a reference to a list of directories which should be searched when the names attribute has been specified and the files attribute has not been specified. The default value is a single-element list containing . (i.e. the current directory).


This is a boolean value which controls whether or not to search recursively through the specified directories looking for matching log files. The default is true.


This is a string which controls the sequence in which the list of files found will be parsed. The supported options are: random, name_asc, name_desc, size_asc, size_desc, the default is random.

Typically you will want to randomise the order of the list so that multiple processes will pass through the files in different orders which should make the process more efficient. The size sorting can be very useful if you really do need to leave the biggest files until last (or get them done first).


This is a string which is used to set limits on the size (in bytes) of the files which will be parsed. The format follows the semantics supported by the Number::Compare module, for example "<100M" or "<200K". If this is not set or is set to 0 (zero) all files will be parsed.


The class provides implementations of the two methods required by the BuzzSaw::DataSource role.


This resets all internal iterators which are used to track the current location in the currently open file. It also forces a rescan of the file system if the names attribute has been specified. You probably want to call this just before you start working through the list of entries.

$entry = $source->next_entry

This method works through the set of files as a single continuous stream. Whenever the end-of-file is reached in one file the next is opened until the complete set of data is exhausted. When the end of the stream is reached an undef value will be returned, if an error occurs this method will die. The sequence in which files are parsed is controlled by the order_by attribute.

Note that this module can handle files which are compressed using gzip (if the file name suffix is .gz) or bzip2 (if the file name suffix is .bz2).


This module is powered by Moose, it also requires MooseX::Types, MooseX::Log::Log4perl and MooseX::SimpleConfig.

It also needs File::Find::Rule.


BuzzSaw, BuzzSaw::DataSource, DataSource::Importer


This is the list of platforms on which we have tested this software. We expect this software to work on any Unix-like platform which is supported by Perl.



Please report any bugs or problems (or praise!) to, feedback and patches are also always very welcome.


    Stephen Quinney <>


    Copyright (C) 2012 University of Edinburgh. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the terms of the GPL, version 2 or later.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 547:

You forgot a '=back' before '=head1'