package App::lcpan::Manual::Internals;

# AUTHORITY
# DATE
our $DIST = 'App-lcpan-Manual'; # DIST
# VERSION

1;
# ABSTRACT: App::lcpan internals

__END__

=pod

=encoding UTF-8

=head1 NAME

App::lcpan::Manual::Internals - App::lcpan internals

=head1 VERSION

version 1.058.000

=head1 INDEXING

Indexing is done in several steps. The last step (parsing release files) is done
in at least 3 passes. We can skip one or more of these passes to save time, if
we don't need the information that the passes gather.

=head2 First step: parse authors/01mailrc.txt.gz

First, we parse F<authors/01mailrc.txt.gz> and insert the data into C<author>
table. Some DarkPANs like those produced by L<OrePAN> have
F<authors/00whois.xml> instead.

=head2 Second step: parse modules/02packages.details.txt.gz

Then we parse F<modules/02packages.details.txt.gz>, which is the main meat of
CPAN index. This file links package (module) names to release tarballs. A
snippet from the file:

 ...
 Log::ger                          0.037  P/PE/PERLANCAR/Log-ger-0.037.tar.gz
 Log::ger::App                     0.014  P/PE/PERLANCAR/Log-ger-App-0.014.tar.gz
 Log::ger::DBI::Query              0.001  P/PE/PERLANCAR/Log-ger-DBI-Query-0.001.tar.gz
 Log::ger::Filter                  0.037  P/PE/PERLANCAR/Log-ger-0.037.tar.gz
 Log::ger::Filter::Code            0.037  P/PE/PERLANCAR/Log-ger-0.037.tar.gz
 ...

We insert these records to C<file> table, so each release file gets a numeric
file ID, and C<module> table, so each module gets a numeric module ID as well as
link to its file ID.

At this point, we haven't parsed distribution names yet because that will need
information from META.{json,yaml} inside the release files.

=head2 Third step: (release) files

Then we start to examine the release files. This is done in several passes and
you have the option to skip some of the passes. The third step is done in
multiple passes because in pass 2, we want to collect all known scripts first to
be able to detect links to scripts in POD (collected in pass 1). Also some
passes are more high-level and/or experimental and/or optional.

=head3 Third step pass 1: content, scripts, distribution metadata, dependency

First we list the content of each release archive and store the results into the
C<content> table. This will allow us to check whether a distribution has a
distribution metadata file (F<META.yml> or F<META.json>), whether a distribution
contains scripts, and so on.

We populate the C<script> table by heuristically including content which from
its name looks like script, e.g.:

 script/foo
 bin/whatever

We then extract the distribution metadata files (either F<META.json> or
F<META.yaml>) and store the information contained in these metadata files into
the database. These include the distribution name (written to the C<dist> table)
and the dependency information (written to the C<dep> table).

At the end of this first pass, we have a pretty useful database already. One of
the main uses of lcpan is to provide dependency information. You can skip the
other passes if you want.

=head3 Third step pass 2: parse POD

In the second pass, we extract modules and script files inside each release file
into a temporary directory, then parse their POD. This pass usually takes
several times the amount of time it takes to complete the first pass. At the
time of this writing (2020-04-19) on my computer, the first pass takes about 14
minutes and the second pass takes 72 minutes. A big release file that contains
thousands of (mostly autogenerated) module files (yes, they exist; see L<Paws>
for example) can take 25 minutes on its own. You might want to skip those files
if you do not expect to ever need to deal with the module/distribution; see the
C<lcpan update> documentation. For example, in F<lcpan.conf> you can put:

 skip_index_file_patterns = ^Paws-\d
 skip_index_file_patterns = ^Google-Ads-GoogleAds-Client-\d
 skip_index_file_patterns = ^Google-Ads-AdWords-Client-\d
 skip_index_file_patterns = ^eBay-API-\d
 skip_index_file_patterns = ^Microsoft-AdCenter-\d
 skip_index_file_patterns = ^VMOMI-\d

By parsing POD, we get: module/script abstract (stored into C<module> table) and
mentions (i.e. a POD that links to another POD, stored in C<mention> table). The
mentions information is mainly useful to know how related a module is to another
(see C<lcpan related-mods> subcommand).

=head3 Third step pass 3: subroutine

In this pass, we try to extract subroutine names in modules. This requires the
use of a source code lexer (lcpan uses L<Compiler::Lexer>). On my computer, this
pass takes another 19 minutes. At the time of this writing, this pass is
experimental and not enabled by default.

=head1 AUTHOR

perlancar <perlancar@cpan.org>

=head1 COPYRIGHT AND LICENSE

This software is copyright (c) 2020 by perlancar@cpan.org.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.

=cut