App::CSVUtils - CLI utilities related to CSV
This document describes version 1.031 of App::CSVUtils (from Perl distribution App-CSVUtils), released on 2023-08-06.
This distribution contains the following CLI utilities:
Usage:
gen_csv_util(%args) -> bool
Generate a CSV utility.
This routine is used to generate a CSV utility in the form of a Rinci function (code and metadata). You can then produce a CLI from the Rinci function simply using Perinci::CmdLine::Gen or, if you use Dist::Zilla, Dist::Zilla::Plugin::GenPericmdScript or, if on the command-line, gen-pericmd-script.
Using this routine, by providing just one or a few hooks and setting some parameters like a couple of extra arguments, you will get a complete CLI with decent POD/manpage, ability to read one or multiple CSV's and write one or multiple CSV's, some command-line options to customize how the input CSV's should be parsed and how the output CSV's should be formatted and named. Your CLI also has tab completion, usage and help message, and other features.
To create a CSV utility, you specify a name (e.g. csv_dump; must be a valid unqualified Perl identifier/function name) and optionally summary, description, and other metadata like links or even add_meta_props. Then you specify one or more of on_* or before_* or after_* arguments to supply handlers (coderefs) for your CSV utility at various hook points.
name
csv_dump
summary
description
links
add_meta_props
on_*
before_*
after_*
THE HOOKS
All code for hooks should accept a single argument r. r is a stash (hashref) of various data, the keys of which will depend on which hook point being called. You can also add more keys to store data or for flow control (see hook documentation below for more details).
r
The order of the hooks, in processing chronological order:
on_begin
Called when utility begins, before reading CSV. You can use this hook e.g. to process arguments, set output filenames (if you allow custom output filenames).
before_read_input
Called before opening any input CSV file. This hook is still called even if your utility sets reads_csv to false.
reads_csv
At this point, the input_filenames stash key (as well as other keys like input_filename, input_filenum, etc) has not been set. You can use this hook e.g. to set a custom input_filenames.
input_filenames
input_filename
input_filenum
before_open_input_files
Called before an input CSV file is about to be opened, including for stdin (-). You can use this hook e.g. to check/preprocess input file. Flow control is available by setting $r->{wants_skip_files} to skip reading all the input files and go directly to the after_read_input hook.
-
$r->{wants_skip_files}
after_read_input
before_open_input_file
Called before each input CSV file is about to be opened, including for stdin (-). For the first file, called after before_open_input_file hook. You can use this hook e.g. to check/preprocess input file. Flow control is available by setting $r->{wants_skip_file} to skip reading a single input file and go to the next file, or $r->{wants_skip_files} to skip reading the rest of the files and go directly to the after_read_input hook.
$r->{wants_skip_file}
on_input_header_row
Called when receiving header row. Will be called for every input file, and called even when user specify --no-input-header, in which case the header row will be the generated ["field1", "field2", ...]. You can use this hook e.g. to add/remove/rearrange fields.
--no-input-header
["field1", "field2", ...]
You can set $r->{wants_fill_rows} to a defined false if you do not want $r->{input_rows} to be filled with empty string elements when it contains less than the number of fields (in case of sparse values at the end). Normally you only want to do this when you want to do checking, e.g. in csv-check-rows.
$r->{wants_fill_rows}
$r->{input_rows}
on_input_data_row
Called when receiving each data row. You can use this hook e.g. to modify the row or print output (for line-by-line transformation or filtering).
after_close_input_file
Called after each input file is closed, including for stdin (-) (although for stdin, the handle is not actually closed). Flow control is possible by setting $r->{wants_skip_files} to skip reading the rest of the files and go straight to the after_close_input_files hook.
after_close_input_files
Called after the last input file is closed, after the last after_close_input_file hook, including for stdin (-) (although for stdin, the handle is not actually closed).
Called after the last row of the last CSV file is read and the last file is closed. This hook is still called, if you set reads_csv option to false. At this point the stash keys related to CSV reading have all been cleared, including input_filenames, input_filename, input_fh, etc.
input_fh
You can use this hook e.g. to print output if you buffer the output.
on_end
Called when utility is about to exit. You can use this hook e.g. to return the final result.
THE STASH
The common keys that r will contain:
gen_args, hash. The arguments used to generate the CSV utility.
gen_args
util_args, hash. The arguments that your CSV utility accepts. Parsed from command-line arguments (or configuration files, or environment variables).
util_args
name, str. The name of the CSV utility. Which can also be retrieved via gen_args.
code_print, coderef. Routine provided for you to print something. Accepts a string. Takes care of opening the output files for you.
code_print
code_print_row, coderef. Routine provided for you to print a data row. You pass the row (either arrayref or hashref). Takes care of opening the output files for you, as well as printing header row the first time, if needed.
code_print_row
code_print_header_row, coderef. Routine provided for you to print header row. You don't need to pass any arguments. Will only print the header row once per output file if output header is enabled, even if called multiple times.
code_print_header_row
If you are accepting CSV data (reads_csv gen argument set to true), the following keys will also be available (in on_input_header_row and on_input_data_row hooks):
input_parser, a Text::CSV_XS instance for input parsing.
input_parser
input_fields, array of str. Input CSV's field names.
input_fields
input_fields_idx, hash with field name as keys and field index (0-based integer) as values.
input_fields_idx
input_filenames, array of str.
input_filename, str. The name of the current input file being read (- if reading from stdin).
input_filenum, uint. The number of the current input file, 1 being the first file, 2 for the second, and so on.
input_fh, the handle to the current file being read.
input_rownum, uint. The number of rows that have been read (reset after each input file). In on_input_header_row phase, this will be 1 since header row (including the generated one) is the first row. Then in on_input_data_row phase (called the first time for a file), it will be 2 for the first data row, even if physically it is the first row for CSV file that does not have a header.
input_rownum
input_data_rownum, uint. The number of data rows that have been read (reset after each input file). This will be equal to input_rownum less 1 if input file has header.
input_data_rownum
input_row, aos (array of str). The current input CSV row as an arrayref.
input_row
input_row_as_hashref, hos (hash of str). The current input CSV row as a hashref, with field names as hash keys and field values as hash values. This will only be calculated if utility wants it. Utility can express so by setting $r->{wants_input_row_as_hashref} to true, e.g. in the on_begin hook.
input_row_as_hashref
$r->{wants_input_row_as_hashref}
input_header_row_count, uint. Contains the number of actual header rows that have been read. If CLI user specifies --no-input-header, this will stay at zero. Will be reset for each CSV file.
input_header_row_count
input_data_row_count, int. Contains the number of actual data rows that have read. Will be reset for each CSV file.
input_data_row_count
If you are outputting CSV (writes_csv gen argument set to true), the following keys will be available:
writes_csv
output_emitter, a Text::CSV_XS instance for output.
output_emitter
output_fields, array of str. Should be set to list of output field names. If unset, will be set to be the same as input_fields.
output_fields
output_fields_idx, hash with field names as keys and field indexes (0-based integer) as values. Normally you do not need to set this manually; you just need to set output_fields and this hash will be computed automatically for you just before the first output row is outputted.
output_fields_idx
output_filenames, array of str.
output_filenames
output_filename, str, name of current output file.
output_filename
output_filenum, uint, the number of the current output file, 1 being the first file, 2 for the second, and so on.
output_filenum
output_fh, handle to the current output file.
output_fh
output_rownum, uint. The number of rows that have been outputted (reset after each output file).
output_rownum
output_data_rownum, uint. The number of data rows that have been outputted (reset after each output file). This will be equal to input_rownum less 1 if input file has header.
output_data_rownum
For other hook-specific keys, see the documentation for associated hook point.
ACCEPTING ADDITIONAL COMMAND-LINE OPTIONS/ARGUMENTS
As mentioned above, you will get additional command-line options/arguments in $r->{util_args} hashref. Some options/arguments are already added by gen_csv_util, e.g. input_filename or input_filenames along with input_sep_char, etc (when your utility declares reads_csv), output_filename or output_filenames along with overwrite, output_sep_char, etc (when your utility declares writes_csv).
$r->{util_args}
gen_csv_util
input_sep_char
overwrite
output_sep_char
If you want to accept additional arguments/options, you specify them in add_args (hashref, with key being Each option/argument has to be specified first via add_args (as hashref, with key being argument name and value the argument specification as defined in Rinci::function)). Some argument specifications have been defined in App::CSVUtils and can be used. See existing utilities for examples.
add_args
READING CSV DATA
To read CSV data, normally your utility would provide handler for the on_input_data_row hook and sometimes additionally on_input_header_row.
OUTPUTTING STRING OR RETURNING RESULT
To output string, usually you call the provided routine $r->{code_print}. This routine will open the output files for you.
$r->{code_print}
You can also return enveloped result directly by setting $r->{result}.
$r->{result}
OUTPUTTING CSV DATA
To output CSV data, usually you call the provided routine $r->{code_print_row}. This routine accepts a row (arrayref or hashref). This routine will open the output files for you when needed, as well as print header row automatically.
$r->{code_print_row}
You can also buffer rows from input to e.g. $r->{output_rows}, then call $r->{code_print_row} repeatedly in the after_read_input hook to print all the rows.
$r->{output_rows}
READING MULTIPLE CSV FILES
To read multiple CSV files, you first specify reads_multiple_csv. Then, you can supply handler for on_input_header_row and on_input_data_row as usual. If you want to do something before/after each input file, you can also supply handler for before_open_input_file or after_close_input_file.
reads_multiple_csv
WRITING TO MULTIPLE CSV FILES
Similarly, to write to many CSv files, you first specify writes_multiple_csv. Then, you can supply handler for on_input_header_row and on_input_data_row as usual. To switch to the next file, set $r->{wants_switch_to_next_output_file} to true, in which case the next call to $r->{code_print_row} will close the current file and open the next file.
writes_multiple_csv
$r->{wants_switch_to_next_output_file}
CHANGING THE OUTPUT FIELDS
When calling $r->{code_print_row}, you can output whatever fields you want. By convention, you can set $r->{output_fields} and $r->{output_fields_idx} to let other handlers know about the output fields. For example, see the implementation of csv-concat.
$r->{output_fields}
$r->{output_fields_idx}
This function is not exported by default, but exportable.
Arguments ('*' denotes required arguments):
add_args => hash
(No description)
add_args_rels => hash
add_meta_props => hash
Add additional Rinci function metadata properties.
after_close_input_file => code
after_close_input_files => code
after_read_input => code
before_open_input_file => code
before_open_input_files => code
before_read_input => code
description => str
examples => array
links => array[hash]
name* => perl::identifier::unqualified_ascii
on_begin => code
on_end => code
on_input_data_row => code
on_input_header_row => code
reads_csv => bool (default: 1)
Whether utility reads CSV data.
reads_multiple_csv => bool
Whether utility accepts CSV data.
Setting this option to true will implicitly set the reads_csv option to true, obviously.
summary => str
writes_csv => bool (default: 1)
Whether utility writes CSV data.
writes_multiple_csv => bool
Whether utility outputs CSV data.
Setting this option to true will implicitly set the writes_csv option to true, obviously.
Return value: (bool)
$coderef = compile_eval_code($str, $label);
Compile string code $str to coderef in 'main' package, without use strict or use warnings. Die on compile error.
$str
use strict
use warnings
$res = eval_code($coderef, $r, $topic_var_value, $return_topic_var);
Use the --no-header option. Fields will be named field1, field2, and so on.
--no-header
field1
field2
Use the --tsv option.
--tsv
These utilities are not (yet) optimized, patches welcome. If your CSV is very big, perhaps a C-based solution is what you need.
Please visit the project's homepage at https://metacpan.org/release/App-CSVUtils.
Source repository is at https://github.com/perlancar/perl-App-CSVUtils.
App::TSVUtils, App::LTSVUtils, App::SerializeUtils.
xls2csv and xlsx2csv from Spreadsheet::Read
import-csv-to-sqlite from App::SQLiteUtils
Query CSV with SQL using fsql from App::fsql
csvgrep from csvgrep
csvkit, https://csvkit.readthedocs.io/en/latest/
perlancar <perlancar@cpan.org>
Adam Hopkins <violapiratejunky@gmail.com>
To contribute, you can send patches by email/via RT, or send pull requests on GitHub.
Most of the time, you don't need to build the distribution yourself. You can simply modify the code, then test via:
% prove -l
If you want to build the distribution (e.g. to try to install it locally on your system), you can install Dist::Zilla, Dist::Zilla::PluginBundle::Author::PERLANCAR, Pod::Weaver::PluginBundle::Author::PERLANCAR, and sometimes one or two other Dist::Zilla- and/or Pod::Weaver plugins. Any additional steps required beyond that are considered a bug and can be reported to me.
This software is copyright (c) 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016 by perlancar <perlancar@cpan.org>.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=App-CSVUtils
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
To install App::CSVUtils, copy and paste the appropriate command in to your terminal.
cpanm
cpanm App::CSVUtils
CPAN shell
perl -MCPAN -e shell install App::CSVUtils
For more information on module installation, please visit the detailed CPAN module installation guide.