package HTML::Untemplate;
# ABSTRACT: web scraping assistant
# PODNAME: HTML::Untemplate

use strict;
use utf8;
use warnings qw(all);

use Moo;
extends 'HTML::Linear';

our $VERSION = '0.019'; # VERSION


1;

__END__

=pod

=encoding UTF-8

=head1 NAME

HTML::Untemplate - web scraping assistant

=head1 VERSION

version 0.019

=head1 DESCRIPTION

Suppose you have a set of HTML documents generated by populating the same template with the data from some kind of database.
L<HTML::Untemplate> is a set of command-line tools
(L</xpathify>, L</untemplate>)
and modules (L<HTML::Linear> and it's dependencies)
which assist in original data retrieval.

This process is also known as L<wrapper induction|https://en.wikipedia.org/wiki/Wrapper_(data_mining)>.

To achieve this goal, HTML tree nodes are presented as XPath/content pairs.
HTML documents linearized this way can be easily inspected manually or with a L<diff> tool.
Please refer to L</EXAMPLES>.

Despite being named similarly to L<HTML::Template>, this distribution is not directly related to it.
Instead, it attempts to reverse the templating action, whatever the template agent used.

=head2 Why?

Suppose you have a CMS.
Typical CMS works roughly as this (data flows bottom-down):

            RDBMS
      scripting language
             HTML
         HTTP server
            (...)
          HTTP agent
        layout engine
            screen
             user

Consider the first 3 steps: C<RDBMS =E<gt> scripting language =E<gt> HTML>

This is "applying template".

Now, consider this: C<HTML =E<gt> scripting language =E<gt> RDBMS>

I would call that "un-applying template", or "untemplate" C<:)>

The practical application of this set of tools is to assist in creation of web scrappers.

A similar (however completely unrelated) approach is described in the paper L<XPath-Wrapper Induction for Data Extraction|http://www.coltech.vnu.edu.vn/~thuyhq/papers/10_Khanh_Cuong_thuy_4288a150.pdf>.

=head2 Human-readability

Consider the following HTML node address representations:

=over 4

=item *

C<0.1.3.0.0.4.0.0.0.2> (L<HTML::TreeBuilder> internal address representation);

=item *

C</html/body/div[4]/div/div[1]/table[2]/tr/td/ul/li[3]> (L<HTML::Linear>, strict);

=item *

C<//td[1]/ul[1]/li[3]> (L<HTML::Linear>, strict, shrink);

=item *

C</html/body[@class='section_home']/div[@id='content_holder'][1]/div[@id='content']/div[@id='main']/table[@class='content_table'][2]/tr/td/ul/li[@class='rss_content rss_content_col'][2]> (L<HTML::Linear>, non-strict);

=item *

C<//li[@class='rss_content rss_content_col'][2]> (L<HTML::Linear>, non-strict, shrink).

=back

They all point to the same node, however, their verbosity/readability vary.
The I<strict> mode specifies tag names and positions only.
Disabling I<strict> will use additional data from CSS selectors.
I<Shrink> mode attempts to find the shortest XPath unique for every node
(C</html/body> is shared among almost all nodes, thus is likely to be irrelevant).

=head1 EXAMPLES

=head2 xpathify

The L<xpathify> tool flatterns the HTML tree into key/value list:

    <!DOCTYPE html>
    <html>
        <head>
            <title>Hello HTML</title>
        </head>
        <body>
            <h1>Hello World!</h1>
            <p>This is a sample HTML</p>
            Beware!
            <p>HTML is <b>not</b> XML!</p>
            Have a nice day.
        </body>
    </html>

Becomes:

I<(HTML block)>

=for html     <table summary="">
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">head</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">title</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td>Hello&nbsp;HTM</td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">a</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td>&quot;title&quot;</td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">a</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">href</span></td><td>#title</td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">h1</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td>Hello&nbsp;World!</td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td>This&nbsp;is&nbsp;a&nbsp;sample&nbsp;HTM</td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">a</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td>&quot;p&quot;</td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">a</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">href</span></td><td>#p</td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td><u style="background-color:#f00;color:#f00">&nbsp;</u>Beware!<u style="background-color:#f00;color:#f00">&nbsp;</u></td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">2</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td>HTML&nbsp;is<u style="background-color:#f00;color:#f00">&nbsp;</u></td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">2</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">b</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td>not</td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">2</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td><u style="background-color:#f00;color:#f00">&nbsp;</u>XML!</td></tr>
    <tr><td><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td><td><u style="background-color:#f00;color:#f00">&nbsp;</u>Have&nbsp;a&nbsp;nice&nbsp;day.<u style="background-color:#f00;color:#f00">&nbsp;</u></td></tr>
    </table>

The keys are in XPath format, while the values are respective content from the HTML tree.
Theoretically, it could be possible to reassemble the HTML tree from the flat key/value list this tool generates.

=head2 untemplate

The L<untemplate> tool flatterns a set of HTML documents using the algorithm from L<xpathify>.
Then, it strips the shared key/value pairs.
The "rest" is composed of original values fed into the template engine.

And this is how the result actually looks like with some simple real-world examples
(quotes L<1839|http://bash.org/?1839> and L<2486|http://bash.org/?2486> from L<bash.org|http://bash.org/>):

I<(HTML block)>

=for html     <table summary="">
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">head</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">title</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>QDB:&nbsp;Quote&nbsp;#2486</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>QDB:&nbsp;Quote&nbsp;#1839</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">html</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">body</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">form</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">center</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">table</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">tr</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">td</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">2</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">font</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">b</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>Quote&nbsp;#2486</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>Quote&nbsp;#1839</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">class</span><span style="font-weight:bold;color:#88f">=</span><span style="font-weight:bold;color:#808">&#39;quote&#39;</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">a</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">href</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>?2486</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>?1839</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">class</span><span style="font-weight:bold;color:#88f">=</span><span style="font-weight:bold;color:#808">&#39;quote&#39;</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">a</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">b</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>#2486</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>#1839</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">a</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">class</span><span style="font-weight:bold;color:#88f">=</span><span style="font-weight:bold;color:#808">&#39;qa&#39;</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">href</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>./?le=cc8456a913b26eb7364e4e9a94348d04&amp;rox=2486</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>./?le=cc8456a913b26eb7364e4e9a94348d04&amp;rox=1839</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">class</span><span style="font-weight:bold;color:#88f">=</span><span style="font-weight:bold;color:#808">&#39;quote&#39;</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>(228)</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>(245)</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">a</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">class</span><span style="font-weight:bold;color:#88f">=</span><span style="font-weight:bold;color:#808">&#39;qa&#39;</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">2</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">href</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>./?le=cc8456a913b26eb7364e4e9a94348d04&amp;sox=2486</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>./?le=cc8456a913b26eb7364e4e9a94348d04&amp;sox=1839</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">a</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">class</span><span style="font-weight:bold;color:#88f">=</span><span style="font-weight:bold;color:#808">&#39;qa&#39;</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">3</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">href</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>./?le=cc8456a913b26eb7364e4e9a94348d04&amp;sux=2486</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>./?le=cc8456a913b26eb7364e4e9a94348d04&amp;sux=1839</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">p</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">class</span><span style="font-weight:bold;color:#88f">=</span><span style="font-weight:bold;color:#808">&#39;qt&#39;</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>&lt;R`:#heroin&gt;&nbsp;Is&nbsp;this&nbsp;for&nbsp;recovery&nbsp;or&nbsp;indulgence?</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>&lt;maff&gt;&nbsp;who&nbsp;needs&nbsp;showers&nbsp;when&nbsp;you&#39;ve&nbsp;got&nbsp;an&nbsp;assortment&nbsp;of&nbsp;feminine&nbsp;products</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    <tr><td colspan="2"><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">tr</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">2</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#880">td</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#080">@</span><span style="font-weight:bold;color:#008">class</span><span style="font-weight:bold;color:#88f">=</span><span style="font-weight:bold;color:#808">&#39;footertext&#39;</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#800">[</span><span style="font-weight:bold;color:#848">1</span><span style="font-weight:bold;color:#800">]</span><span style="font-weight:bold;color:#088">/</span><span style="font-weight:bold;color:#008">text()</span></td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">2486.html</span></td><td>0.0035</td></tr>
    <tr><td><span style="font-weight:bold;color:#f8f">1839.html</span></td><td>0.0033</td></tr>
    <tr><td colspan="2" style="height:1em"></td></tr>
    </table>

=head1 MODULES

May be used to serialize/flattern HTML documents by your own:

=over 4

=item *

L<HTML::Linear> - represent L<HTML::Tree> as a flat list

=item *

L<HTML::Linear::Element> - represent elements to populate L<HTML::Linear>

=item *

L<HTML::Linear::Path> - represent paths inside L<HTML::Tree>

=back

=head1 REFERENCES

=over 4

=item *

L<Wrapper (data mining)|https://en.wikipedia.org/wiki/Wrapper_(data_mining)>

=item *

L<XPath-Wrapper Induction for Data Extraction|http://www.coltech.vnu.edu.vn/~thuyhq/papers/10_Khanh_Cuong_thuy_4288a150.pdf>

=item *

L<Extracting Data from HTML Using TreeBuilder Node IDs|http://secondthought.org/notes/nodeIdProbabilities.html>

=item *

L<Web Scraping Made Simple with SiteScraper|http://sitescraper.googlecode.com/files/sitescraper.pdf>

=back

=head1 SEE ALSO

=over 4

=item *

L<HTML::Similarity>

=item *

L<Template::Extract>

=item *

L<XML::DifferenceMarkup>

=item *

L<XML::XSH2>

=back

=head1 AUTHOR

Stanislaw Pusep <stas@sysd.org>

=head1 COPYRIGHT AND LICENSE

This software is copyright (c) 2014 by Stanislaw Pusep.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.

=cut