Mock::Data::Charset - Generator of strings from a set of characters


  # Export a handy alias for the constructor
  use Mock::Data::Charset 'charset';
  # Use perl's regex notation for [] charsets
  my $charset = charset('A-Za-z');
          ... = charset('\p{alpha}\s\d');
          ... = charset(classes => ['digit']);
          ... = charset(ranges => ['a','z']);
          ... = charset(chars => ['a','e','i','o','u']);
  # Test membership
  charset('a-z')->contains('a') # true
  charset('a-z')->count         # 26
  charset('\w')->count          # 
  charset('\w')->count('ascii') # 
  # Iterate
  my $charset= charset('a-z');
  for (0 .. $charset->count-1) {
    my $ch= $charset->get_member($_)
  # this one can be very expensive if the set is large:
  for ($charset->members->@*) { ... }
  # Generate random strings
  my $str= $charset->generate($mockdata, 10); # 10 random chars from this charset
      ...= $charset->generate($mockdata, { min_codepoint => 1, max_codepoint => 127 }, 10);
      ...= $charset->generate($mockdata, { size => [5,10] }); # between 5 and 10 chars
      ...= $charset->generate($mockdata, { size => sub { 5 + int rand 5 }); # same


This generator is optimized for holding sets of Unicode characters. It behaves just like the Mock::Data::Set generator but it also lets you inspect the member codepoints, iterate the codepoints, and constrain the range of codepoints when generating strings.



  $charset= Mock::Data::Charset->new( %options );
  $charset= charset( %options );
  $charset= charset( $notation );

If you supply a single non-hashref argument to the constructor, it is assumed to be the "notation" string. Otherwise, it is treated as key/value pairs. You may specify the members of the charset by one of the attributes notation, members, or member_invlist, or construct it from the following charset-building options:


An arrayref of literal character values to include in the set.


An arrayref of Unicode codepoint numbers.

  ranges => [ ['a','z'], ['0', '9'] ],
  ranges => [ 'a', 'z', '0', '9' ],

An arrayref holding start/end pairs of characters, optionally with inner arrayrefs for each start/end pair.


Same as ranges but with codepoint numbers instead of characters.


An arrayref of character class names recognized by perl (such as Posix or Unicode classes).


Negate the membership of the charset as described by chars/ranges/classes. This applies to the charset-building options, but has no effect on attributes.

The constructor may also be given any of the keys for "generate_opts", which will be moved into that attribute.

For convenience, you may export the "charset" in Mock::Data::Util which calls this constructor.

If you call new on an object, it carries over the following settings to the new object: max_codepoint, generator_opts, member_invlist (unless chars change).



A Perl Regex charset notation; the text that occurs between '[...]' in a regex. (Note that if you use backslash notations, like notation => '\w', you should either use a single-quoted string, or escape them as "\\w".

This returns the same string that was passed to the constructor, if you gave the constructor a regex-notation string instead of more specific attributes. If you did not, a generic-looking notation will be built on demand. Read-only.


Minimum codepoint to be returned from the generator. Read/write. This is useful if you want to eliminate control characters (or maybe just NULs) in your output.


Maximum unicode codepoint to be considered. Read-only. If you are only interested in a subset of the Unicode character space, such as ASCII, you can set this to a value like 0x7F and speed up the calculations on the character set.


This determines the length of string that will be returned from generate if no length is specified to that function. This may be a plain integer, an arrayref of [$min,$max], or a coderef that returns an integer: sub { 5 + int rand 10 }.


The number of members in the set. Read-only.


Returns an arrayref of each character in the set. Try not to use this attribute, as building it can be very expensive for common sets like [:alpha:] (100K members, tens of MB of RAM). Use "member_invlist" or "get_member" instead, when possible, or set "max_codepoint" to restrict the set to characters you care about.



Return an arrayref holding the "inversion list" describing the members of this set. An inversion list stores the first codepoint belonging to the set, followed by the next higher codepoint which does not belong to the set, followed by the next that does, etc. This data structure allows for efficient negation/inversion of the list.

You may write a new value to this attribute, but not modify the existing array.



  $charset->generate($mockdata, $len);
  $charset->generate($mockdata, \%options, $len);
  $charset->generate($mockdata, \%options);

Generate a string of characters from this charset. The %options may override the following attributes: "min_codepoint", "max_codepoint" (but only smaller values), and "str_len". The default length is 1 character.


Return a plain coderef that invokes "generate" on this object.


  my $parse_info= Mock::Data::Charset->parse('\dA-Z_');
  # {
  #   codepoints        => [ ord '_' ],
  #   codepoint_ranges  => [ ord "A", ord "Z" ],
  #   classes           => [ 'digit' ],
  # }

This is a class method that accepts a Perl-regex-notation string for a charset and returns a hashref of the arguments that should be passed to the constructor.

This dies if it encounters a syntax error or any Perl feature that wasn't implemented.


  my $char= $charset->get_member($offset);

Return the Nth character of the set, starting from 0. Returns undef for values greater or equal to "count". You can use negative offsets to index from the end of the list, like in substr.


Same as "get_member" but returns a codepoint integer instead of a character.


  my ($offset, $ins_pos)= $charset->find_member($char);

Return the index of a character within the members list. If the character is not a member, this returns undef, but if you call it in array context the second element gives the position where it would be found if it was a member.


  my $charset2= $charset->negate;

Return a new charset which contains exactly the opposite characters as this one, up to the "max_codepoint" if defined.


  my $charset3= $charset1->union($charset2, ...);

Merge one or more charsets. The result contains every character of any set, but clamped to the max_codepoint of the current set.

The arguments may also be plain inversion list arrayrefs instead of charset objects.


Michael Conrad <>


version 0.03


This software is copyright (c) 2021 by Michael Conrad.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.