=encoding utf8

=head1 TITLE

DRAFT: Synopsis 32: Setting Library - Str

=head1 AUTHORS

    Rod Adams <rod@rodadams.net>
    Larry Wall <larry@wall.org>
    Aaron Sherman <ajs@ajs.com>
    Mark Stosberg <mark@summersault.com>
    Carl Mäsak <cmasak@gmail.com>
    Moritz Lenz <moritz@faui2k3.org>
    Tim Nelson <wayland@wayland.id.au>

=head1 VERSION

    Created: 19 Mar 2009 extracted from S29-functions.pod

    Last Modified: 17 Apr 2009
    Version: 3

The document is a draft.

If you read the HTML version, it is generated from the Pod in the specs
repository under
L<https://github.com/perl6/specs/blob/master/S32-setting-library/Str.pod>
so edit it there in the git repository if you would like to make changes.

=head1 Str

General notes about strings:

A Str can exist at several Unicode levels at once. Which level you
interact with typically depends on what your current lexical context has
declared the "working Unicode level to be". Default is C<Grapheme>.
[Default can't be C<CharLingua> because we don't go into "language"
mode unless there's a specific language declaration saying either
exactly what language we're going into or, in the absence of that, how to
find the exact language somewhere in the environment.]

Attempting to use a string at a level higher it can support is handled
without warning. The current highest supported level of the string
is simply mapped Char for Char to the new higher level. However,
attempting to stuff something of a higher level a lower-level string
is an error (for example, attempting to store Kanji in a Byte string).
An explicit conversion function must be used to tell it how you want it
encoded.

Attempting to use a string at a level lower than what it supports is not
allowed.

If a function takes a C<Str> and returns a C<Str>, the returned C<Str>
will support the same levels as the input, unless specified otherwise.

The following are all provided by the C<Str> role:

=over

=item p5chop

 our Char multi method p5chop ( Str  $string is rw: ) is export(:P5)
 my Char multi p5chop ( Str *@strings is rw ) is export(:P5)

Trims the last character from C<$string>, and returns it. Called with a
list, it chops each item in turn, and returns the last character
chopped.

=item chop

 our Str multi method chop ( Str  $string: ) is export

Returns string with one Char removed from the end.

=item p5chomp

 our Int multi method p5chomp ( Str  $string is rw: ) is export(:P5)
 my Int multi p5chomp ( Str *@strings is rw ) is export(:P5)

Related to C<p5chop>, only removes trailing chars that match C</\n/>. In
either case, it returns the number of chars removed.

=item chomp

 our Str multi method chomp ( Str $string: ) is export

Returns string with one newline removed from the end.  An arbitrary
terminator can be removed if the input filehandle has marked the
string for where the "newline" begins.  (Presumably this is stored
as a property of the string.)  Otherwise a standard newline is removed.

Note: Most users should just let their I/O handles autochomp instead.
(Autochomping is the default.)

=item lc

 our Str multi method lc ( Str $string: ) is export

Returns the input string after converting each character to its lowercase
form, if uppercase.


=item lcfirst

 our Str multi method lcfirst ( Str $string: ) is export

Like C<lc>, but only affects the first character.


=item uc

 our Str multi method uc ( Str $string: ) is export

Returns the input string after converting each character to its uppercase
form, if lowercase. This is not a Unicode "titlecase" operation, but a
full "uppercase".


=item ucfirst

 our Str multi method ucfirst ( Str $string: ) is export

Performs a Unicode "titlecase" operation on the first character of the string.

=item normalize

 our Str multi method normalize ( Str $string: Bool :$canonical = Bool::True, Bool :$recompose = Bool::False ) is export

Performs a Unicode "normalization" operation on the string. This involves
decomposing the string into its most basic combining elements, and potentially
re-composing it. Full detail on the process of decomposing and
re-composing strings in a normalized form is covered in the Unicode
specification Sections 3.7, Decomposition and 3.11,
Canonical Ordering Behavior of the Unicode Standard, 4.0.
Additional named parameters are reserved for future Unicode expansion.

For everyday use there are aliases that map to the
I<Unicode Standard Annex #15: Unicode Normalization Forms> document's
names for the various modes of normalization:

 our Str multi method nfd ( Str $string: ) is export {
   $string.normalize(:canonical, :!recompose);
 }
 our Str multi method nfc ( Str $string: ) is export {
   $string.normalize(:canonical, :recompose);
 }
 our Str multi method nfkd ( Str $string: ) is export {
   $string.normalize(:!canonical, :!recompose);
 }
 our Str multi method nfkc ( Str $string: ) is export {
   $string.normalize(:!canonical, :recompose);
 }

Decomposing a string can be used to compare
Unicode strings in a binary form, providing that they use the same
encoding. Without decomposing first, two
Unicode strings may contain the same text, but not the same byte-for-byte
data, even in the same encoding.
The decomposition of a string is performed according to tables
in the Unicode standard, and should be compatible with decompositions
performed by any system.

The C<:canonical> flag controls the use of "compatibility decompositions".
For example, in canonical mode, "fi" is left unaffected because it is
not a composition. However, in compatibility mode, it will be replaced
with "fi". Decomposed sequences will be ordered in a canonical way
in either mode.

The C<:recompose> flag controls the re-composition of decomposed forms.
That is, a combining sequence will be re-composed into the canonical
composite where possible.

These de-compositions and re-compositions are performed recursively,
until there is no further work to be done.

Note that this function is really only applicable when dealing with codepoint
strings.  Grapheme strings are normally processed at a higher abstraction level
that is independent of normalization, and are lazily normalized into the
desired normalization when transferred to lexical scopes or handles that care.

=item samecase

 our Str multi method samecase ( Str $string: Str $pattern ) is export

Has the effect of making the case of the string match the case pattern in C<$pattern>.
(Used by s:ii/// internally, see L<S05>.)

=item samemark

 our Str multi method samemark ( Str $string: Str $pattern ) is export

Has the effect of making the case of the string match the marking pattern in C<$pattern>.
(Used by s:mm/// internally, see L<S05>.)

=item capitalize

 our Str multi method capitalize ( Str $string: ) is export

Has the effect of first doing an C<lc> on the entire string, then performing a
C<s:g/(\w+)/{ucfirst $0}/> on it.


=item length

This word is banned in Perl 6.  You must specify units.

=item chars

 our Int multi method chars ( Str $string: ) is export

Returns the number of characters in the string in the current
(lexically scoped) idea of what a normal character is, usually graphemes.

=item graphs

 our Int multi method codes ( Str $string: ) is export

Returns the number of graphemes in the string in a language-independent way.

=item codes

 our Int multi method codes ( Str $string: $nf = $?NF) is export

Returns the number of codepoints in the string if it were canonicalized the
specified way.  Do not confuse codepoints with UTF-16 encoding.  Characters
above U+FFFF count as a single codepoint.

=item bytes

 our Int multi method bytes ( Str $string: $enc = $?ENC, :$nf = $?NF) is export

Returns the number of bytes in the string if it were encoded in the
specified way.  Note the inequality:

    .bytes("UTF-16","C") >= .codes("C") * 2

This is caused by the possibility of surrogate pairs, which are counted as one
codepoint.  However, this problem does not arise for UTF-32:

    .bytes("UTF-32","C") == .codes("C") * 4

=item encode

    our Buf multi method encode($encoding = $?ENC, $nf = $?NF)

Returns a C<Buf> which represents the original string in the given encoding
and normal form. The actual return type is as specific as possible, so
C<$str.encode('UTF-8')> returns an C<utf8> object,
C<$str.encode('ISO-8859-1')> a C<buf8>.

=item index

 our StrPos multi method index( Str $string: Str $substring, StrPos $pos = StrPos(0) ) is export
 our StrPos multi method index( Str $string: Str $substring, Int $pos ) is export

C<index> searches for the first occurrence of C<$substring> in C<$string>,
starting at C<$pos>. If $pos is an C<Int>, it is taken to be in the units
of the calling scope, which defaults to "graphemes".

The value returned is always a C<StrPos> object.  If the substring
is found, then the C<StrPos> represents the position of the first
character of the substring. If the substring is not found, a bare
C<StrPos> containing no position is returned.  This prototype C<StrPos>
evaluates to false because it's really a kind of undefined value.  Do not evaluate
as a number, because instead of returning -1 it will return 0 and issue
a warning.


=item pack

 our buf8 multi pack( *@items where { all(@items) ~~ Pair } )
 our buf8 multi pack( Str $template, *@items )

C<pack> takes a list of pairs and formats the values according to
the specification of the keys. Alternately, it takes a string
C<$template> and formats the rest of its arguments according to
the specifications in the template string. The result is a sequence
of bytes.

Templates are strings of the form:

  grammar Str::PackTemplate {
    regex TOP       { ^ <template> $ }
    regex template  { [ <group> | <specifier> <count>? ]* }
    token group     { \( <template> \) }
    token specifier { <[aAZbBhHcCsSiIlLnNvVqQjJfdFDpPuUwxX\@]> \!? }
    token count     { \*
                    | \[ [ \d+ | <specifier> ] \]
                    | \d+ }
  }

In the pairwise mode, each key must contain a single C<< <group> >> or
C<< <specifier> >>, and the values must be either scalar arguments or
arrays.

[ Note: Need more documentation and need to figure out what Perl 5 things
        no longer make sense. Does Perl 6 need any extra formatting
        features? -ajs ]

[I think pack formats should be human readable but compiled to an
internal form for efficiency.  I also think that compact classes
should be able to express their serialization in pack form if
asked for it with .packformat or some such.  -law]

=item quotemeta

 our Str multi method quotemeta ( Str $string: ) is export

Returns the input string with all non-"word" characters back-slashed.
That is, all characters not matching "/<[A..Za..z_0..9]>/" will be preceded
by a backslash in the returned string, regardless of any locale settings.

[Note from Pm:  Should that be "/\w/" instead?  Or, if the intent
is to duplicate p5 functionality, perhaps it should be "p5quotemeta"?
Do we even want this method at all?]

=item rindex
X<rindex>

 our StrPos multi method rindex( Str $string: Str $substring, StrPos $pos? ) is export
 our StrPos multi method rindex( Str $string: Str $substring, Int $pos ) is export

Returns the position of the last C<$substring> in C<$string>. If C<$pos>
is specified, then the search starts at that location in C<$string>, and
works backwards. See C<index> for more detail.

=item split
X<split>

 our List multi split ( Str $delimiter, Str $input, Int $limit = Inf )
 our List multi split ( Regex $delimiter, Str $input, Int $limit = Inf )
 our List multi method split ( Str $input: Str $delimiter, Int $limit = Inf )
 our List multi method split ( Str $input: Regex $delimiter, Int $limit = Inf, Bool :$all = False)

Splits a string up into pieces based on delimiters found in the string.

String delimiters must not be treated as rules but as constants.  The
default is no longer S<' '> since that would be interpreted as a constant.
P5's C<< split('S< >') >> will translate to C<comb>.  Null trailing fields
are no longer trimmed by default.

The C<split> function no longer has a default delimiter nor a default invocant.
In general you should use C<words> to split on whitespace now, or C<comb> to break
into individual characters.  See below.

If the C<:all> adverb is supplied to the C<Regex> form, then the
delimiters are returned as C<Match> objects in alternation with the
split values.  Unlike with Perl 5, if the delimiter contains multiple
captures they are returned as submatches of single C<Match> object.
(And since C<Match> does C<Capture>, whether these C<Match> objects
eventually flatten or not depends on whether the expression is bound
into a list or slice context.)

You may also split lists and filehandles.  C<$*ARGS.split(/\n[\h*\n]+/)>
splits on paragraphs, for instance.  Lists and filehandles are automatically
fed through C<cat> in order to pretend to be string.  The resulting
C<Cat> is lazy.  Accessing a filehandle as both a filehandle and as
a C<Cat> is undefined.

=item comb

 our List multi comb ( Regex $matcher, Str $input, Int $limit = Inf )
 our List multi method comb ( Str $input: Regex $matcher = /./, Int $limit = Inf )

The C<comb> function looks through a string for the interesting bits,
ignoring the parts that don't match.  In other words, it's a version
of split where you specify what you want, not what you don't want.

That means the same restrictions apply to the matcher rule as do to
split's delimiter rule.

By default it pulls out all individual characters.  Saying

    $string.comb(/pat/, $n)

is equivalent to

    map {.Str}, $string.match(rx:global:x(0..$n):c/pat/)

You may also comb lists and filehandles.  C<+$*IN.comb> counts the words on
standard input, for instance.  C<comb(/./, $thing)> returns a list of single
C<Char> strings from anything that can give you a C<Str>.  Lists and
filehandles are automatically fed through C<cat> in order to pretend to
be string.  This C<Cat> is also lazy.

If the C<:match> adverb is applied, a list of C<Match> objects (one
per match) is returned instead of strings.  This can be used to
access capturing subrules in the matcher.  The unmatched portions
are never returned -- if you want that, use C<split :all>.  If the
function is combing a lazy structure, the return values may also be
lazy.  (Strings are not lazy, however.)

=item lines

 our List multi method lines ( Str $input: Int $limit = Inf ) is export

Returns a list of lines, i.e. the same as a call to
C<$input.comb( / ^^ \N* /, $limit )> would.

=item words

 our List multi method words ( Str $input: Int $limit = Inf ) is export

Returns a list of non-whitespace bits, i.e. the same as a call to
C<$input.comb( / \S+ /, $limit )> would.

=item flip

The C<flip> function reverses a string character by character.

     our Str multi method flip ( $str: ) is export {
        $str.comb.reverse.join;
     }

This function will misplace accents if used at a Unicode
level less than graphemes.

=item sprintf

 our Str multi method sprintf ( Str $format: *@args ) is export

This function is mostly identical to the C library sprintf function.

The C<$format> is scanned for C<%> characters. Any C<%> introduces a
format token. Format tokens have the following grammar:

 grammar Str::SprintfFormat {
  regex format_token { '%': <index>? <precision>? <modifier>? <directive> }
  token index { \d+ '$' }
  token precision { <flags>? <vector>? <precision_count> }
  token flags { <[ \x20 + 0 \# \- ]>+ }
  token precision_count { [ <[1..9]>\d* | '*' ]? [ '.' [ \d* | '*' ] ]? }
  token vector { '*'? v }
  token modifier { < ll l h m V q L > }
  token directive { < % c s d u o x e f g X E G b p n i D U O F > }
 }

Directives guide the use (if any) of the arguments. When a directive
(other than C<%>) is used, it indicates how the next argument
passed is to be formatted into the string.

The directives are:

 %   a literal percent sign
 c   a character with the given codepoint
 s   a string
 d   a signed integer, in decimal
 u   an unsigned integer, in decimal
 o   an unsigned integer, in octal
 x   an unsigned integer, in hexadecimal
 e   a floating-point number, in scientific notation
 f   a floating-point number, in fixed decimal notation
 g   a floating-point number, in %e or %f notation
 X   like x, but using upper-case letters
 E   like e, but using an upper-case "E"
 G   like g, but with an upper-case "E" (if applicable)
 b   an unsigned integer, in binary
 C   special: invokes the arg as code, see below

Compatibility:

 i   a synonym for %d
 D   a synonym for %ld
 U   a synonym for %lu
 O   a synonym for %lo
 F   a synonym for %f

Perl 5 (non-)compatibility:

 n   produces a runtime exception (see below)
 p   produces a runtime exception

The special format directive, C<%C> invokes the target argument as
code, passing it the result string that has been generated thus
far and the argument array.

Here's an example of its use:

 sprintf "%d%C is %d digits long",
    $num,
    sub ($s, @args is rw) { @args[2] = $s.elems },
    0;

The special directive, C<%n> does not work in Perl 6 because of the
difference in parameter passing conventions, but the example above
simulates its effect using C<%C>.

Modifiers change the meaning of format directives. The most important being
support for complex numbers (a basic type in Perl). Here are all of the
modifiers and what they modify:

 h interpret integer as native "short" (typically int16)
 l interpret integer as native "long" (typically int32 or int64)
 ll interpret integer as native "long long" (typically int64)
 L interpret integer as native "long long" (typically uint64)
 q interpret integer as native "quads" (typically int64 or larger)
 m interpret value as a complex number

The C<m> modifier works with C<d,u,o,x,F,E,G,X,E> and C<G> format
directives, and the directive applies to both the real and imaginary
parts of the complex number.

Examples:

 sprintf "%ld a big number, %lld a bigger number, %mf complexity\n",
    4294967295, 4294967296, 1+2i);

=item fmt

  our Str multi method fmt( Scalar $scalar: Str $format = '%s' )
  our Str multi method fmt( List $list: Str $format = '%s',
                                        Str $separator = ' ' )
  our Str multi method fmt( Hash $hash: Str $format = "%s\t%s",
                                        Str $separator = "\n" )
  our Str multi method fmt( Pair $pair: Str $format = "%s\t%s" )

A set of wrappers around C<sprintf>. A call to the scalar version
C<$o.fmt($format)> returns the result of C<sprintf($format, $o)>. A call to
the list version C<@a.fmt($format, $sep)> returns the result of
C<join $sep, map { sprintf($format, $_) }, @a>. A call to the hash version
C<%h.fmt($format, $sep)> returns the result of
C<join $sep, map { sprintf($format, $_.key, $_.value) }, %h.pairs>. A call
to the pair version C<$p.fmt($format)> returns the result of
C<sprintf($format, $p.key, $p.value)>.

=item substr

 our Str multi method substr (Str $string: StrPos $start, StrLen $length?) is rw is export
 our Str multi method substr (Str $string: StrPos $start, StrPos $end) is rw is export
 our Str multi method substr (Str $string: StrPos $start, Int $length) is rw is export
 our Str multi method substr (Str $string: Int $start, StrLen $length?) is rw is export
 our Str multi method substr (Str $string: Int $start, StrPos $end) is rw is export
 our Str multi method substr (Str $string: Int $start, Int $length) is rw is export

C<substr> returns part of an existing string. You control what part by
passing a starting position and optionally either an end position or length.
If you pass a number as either the position or length, then it will be used
as the start or length with the assumption that you mean "chars" in the
current Unicode abstraction level, which defaults to graphemes.  A number
in the 3rd argument is interpreted as a length rather than a position (just
as in Perl 5).

Here is an example of its use:

 $initials = substr($first_name,0,1) ~ substr($last_name,0,1);

Optionally, you can use substr on the left hand side of an assignment
like so:

 $string ~~ /(barney)/;
 substr($string, $0.from, $0.to) = "fred";

If the replacement string is longer or shorter than the matched sub-string,
then the original string will be dynamically resized.

=item trim

  multi method trim() is export;

Returns a copy of the string, with leading and trailing whitespace removed.

=item unpack

=item match

 method Match match(Str $self: Regex $search);

See L<S05/Substitution>

=item subst

 method Str subst(Str $self: Regex $search, Str $replacement);

=item trans

 method trans(Str $self: Str $key, Str $val);

 our multi trans(List of Pair %data);

=item indent

     our Str multi method indent ($str: $steps) is export;

Returns a re-indented string wherein C<$steps> number of spaces have been added
to each line (i.e. to the beginning of the string and after each logical
newline).

If C<$steps> is negative, removes that many spaces instead. Should any line
contain too few leading spaces, only those are removed and a warning is issued.

If C<$steps> is C<*>, removes exactly as many spaces as are needed to make at
least one line have zero indentation.

When removing indentation, the method will assume hard tabs to be
C<< ($?TABSTOP // 8) >> spaces, and will treat other horizontal whitespace
characters as synonymous to spaces. Characters not participating in the
re-indenting will be left untouched, and those added in an indent call will
be either (1) consistent with subsequent leading whitespace already on the
line, if these are all the same, or (2) spaces. During an unindent, the
trailing tab character in a chain of leading tab characters may explode
into a number of space characters.

=back

=head1 Additions

Please post errors and feedback to perl6-language.  If you are making
a general laundry list, please separate messages by topic.