The Perl Advent Calendar needs more articles for 2022. Submit your idea today!

NAME

Email::Extractor::Utils - Set of functions that can be useful when building web crawlers

VERSION

version 0.03

SYNOPSIS

  use Email::Extractor::Utils qw( looks_like_url looks_like_file get_file_uri load_addr_to_str )
  # or use Email::Extractor::Utils qw[:ALL];
  $Email::Extractor::Utils::Verbose = 1;
  
  my $text = load_addr_to_str($url);

$Email::Extractor::Utils::Assets

List of asset extensions, used in "drop_asset_links" in Email::Extractor::Utils

To see default list of assets:

    perl -Ilib -E "use Email::Extractor::Utils qw(:ALL); use Data::Dumper; print Dumper $Email::Extractor::Utils::Assets;"

load_addr_to_str

Accept URI of file path and return string with content

    my $text = load_addr_to_str($url);
    my $text = load_addr_to_str($path_to_file);

Function can accept http(s) uri or file paths both

dies if no such file

return $resp->content even if no such url

If verbose mode enabled prints time of request

Can be used in tests when you need to mock http requests also

get_abs_path

Return absolute path of file relative to current working directory

get_file_uri

Make absolute path from relative (to cwd) and return absolute path that can pass Regexp::Common::URI::file validation

    get_file_uri('/test')   # 'file:///root/test' if cwd is /root

looks_like_url

    looks_like_url('http://example.com')      # 1
    looks_like_url('https://example.com')      # 1
    looks_like_url('/root/somefolder')        # 0

Detect if link is http or https url

Uses Regexp::Common::URI::http

Return:

O if provided string is not url

url without query, https://metacpan.org/pod/Regexp::Common::URI::http#$7 if provided string is url

    Return true if link looks like relative url, either return false

looks_like_file

    looks_like_file('http://example.com')             # 0
    looks_like_file('file:///root/somefolder')        # 1

Detect if string is file uri or no

Uses Regexp::Common::URI::file

Make all links in array absolute

    my $res = absolutize_links( $links, 'http://example.com ');

$links must be ARRAYREF, return also ARRAYREF

    my $res = absolutize_links( $links, 'http://example.com ');  # leave only links on http://example.com

Relative links stay untouched

$links must be ARRAYREF, return also ARRAYREF

    my $res = drop_asset_links($links)

Leave only links that are not related to assets. Remove query params also

$links must be ARRAYREF, return also ARRAYREF

    my $res = drop_anchor_links ($links)

Leave only links that are not anchors to same page (anchor link is like #rec31047364)

$links must be ARRAYREF, return also ARRAYREF

remove_query_params

Remove GET query params from provided links array

    my $res = remove_query_params($links)

$links must be ARRAYREF, return also ARRAYREF

Find all links and return href attributes of a tags

Return ARRAYREF

    find_links_by_text($html, $a_text, <$upper_lower_case_flag> )

Find all a tags containing particular text and return href values

If no search text specified return all links

Currently is not used in Email::Extractor project since it has unexpected behaviour (see tests)

Return ARRAYREF

TO-DO: try to implement this method with HTML::LinkExtor

isin($str, $arrayref)

    isin( $val, $array_ref )

Check is $str contained in $arrayref

Return true/false.

DESCRIPTION

Set of useful utilities that works with html and urls

NAME

Email::Extractor::Utils

AUTHOR

Pavel Serikov <pavelsr@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018 by Pavel Serikov.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.