This is a documentation-only module that describes how to use simple_scan, and outlines some techniques you can use for some common Web testing problems.


simple_scan reads test specifications from standard input and generates Perl code based on these specifications. It can either

  • execute them immediately,

  • print them on standard output without executing them,

  • or do both: execute them and then print the generated code on standard output.


Test specifications describe

  • where the page is that you want to check,

  • some content (in the form of a Perl regular expression) that you want look for

  • a "success code", which defines whether or not the regex should match, and optionally allows you to run the test as a TODO, or skip it altogether.

    Y - The regular expression should match.
    N - The regular expression should not match.
    TY - This a "TODO" test; eventually the regular expression should match, but isn't expected to now. (If it does, simple_report will not that a test "UNEXPECTEDLY SUCCEEDED".)
    TN - This "TODO" test is expected to match now, but eventually should not. (If it fails to match, this is also an "unexpected success".)
    S - This test is to be skipped. Useful for putting in a placeholder test tht you don't currently want to run. This is usually useful when you have a test which is expensive or slow to run, and which you know currently will not pass.
  • and a comment about why you care

 C<simple_scan> always uses an HTTP GET to access the URL; if you need to do stuff like log in, or other setup that requires any other HTTP action, you'll need to use a plugin (see below).

Note that TODO tests get run whether or not they will pass; we just mind if they currently fail. Skipped tests are not run at all. Use skipped tests if you want to save time; use TODO tests if you want to be alerted of a change (from passing to failing or vice versa).

Some example test specs: /\d+ foobars found/ Y Check zorch query                /Perl/              Y Perl mentioned here              /Perl/              N Not mentioned here


You can use pragmas to control how tests are executed. Pragmas start with '%%' at the beginning of the line, followed by a pragma name and arguments if the pragma takes any. simple_scan itself provides 3 pragmas:


This tells simple_scan what User-Agent string to use. Because remembering all the fiddly bits is a pain, you can simply use shortcut names, like "Safari", "Mozilla", or "IE"; the actual list is the one supported by 's agent_alias() method.


Stacks a call to cache() in the tests built by simple_scan. This tells simple_scan to hang onto the last copy of the page fetched from every URL; if the URL is hit multiple times during a test, simple_scan fetches it only once and then reuses the cached copy for further tests.


Turns off caching by stacking a nocache() call in the tests built. simple_scan will always refetch every URL when nocache() is in force.

Here's a sample test using the base pragmas:

  %%agent Safari /download Safar/ N No Safari warning with Safari
  %%agent IE /download Safar/ Y Safari warning with IE /HTML5 Showcase/ Y uses cached copy
  %%nocache /takes a few/    Y uses a new copy


You can define substitutions much like you'd use a pragma:

  %%site apple
  %%subsite html5
  http://<site>.com/<subsite> /HTML5 Showcase/ Y Apple's HTML5 showcase

The twist with simple_scan variables is that they can have multiple values:

  %%query foo bar baz<query> /<query> found/ Y Found <query>

This causes simple_scan to generate code to run three tests, one for each of the values of the 'query' variable. Notice that we can substitute into any part of the test specification; in this case we didn't substitute into the test type, but it's as valid as any other part of the line.

If you have multiple variables with multiple values, simple_scan will generate the Cartesian product of them:

  %%foo one two three four
  %%bar alpha beta gamma delta epsilon
  %%baz now is the time for all good men<foo><bar>baz> /Found:/ Y Looking for <foo>, <bar>, <baz>

This generates 4 * 5 * 8 = 160 tests in just 4 lines.

Pragmas may expand into other pragmas; the previous example could have been written as

  %%foo one two three four
  %%bar alpha beta gamma delta epsilon
  %%baz now is the time for all good men
  %%query <foo><bar><baz><query> ...

In this case, the 'query' variable would have been assigned all 160 values, and anything that used the 'query' variable would be expanded with all of them.

Caution is urged in creating complex nested expansions; making these too complicated can make your generated scripts very hard to debug, as there's currently no easy way to track the expansions and debug them.

Matching non-ASCII Latin-1 characters

First: be sure that the non-ASCII character you're seeing on the screen is actually present in the HTML source. You could be looking at an HTML entity that gets rendered as the character in question. For instance a degree symbol is actually &xB0;.

You can match a specific entity with its actual text:


(Note that we've made sure that it will work whether the hex "digits" are upper or lowercase.) Or you can match an arbitrary entity:


This one will also match things like &amp; and &brkbar; - with great power comes relative imprecision. There's a handy table of Latin-1 entities at

In some cases (e.g., Yahoo!'s search results), there will actually be non-Latin1 characters that are not HTML encoded. This is probably not good practice, but it still exists here and there. To deal with pages like this, copy and paste the exact text from a "view source" into the regex you want to use.

Newer versions of simple_scan handle data smoothly without any special action on your part, even if the encoding's off a bit.


Plugins are Perl modules that extend simple_scan's abilities without modification of the core code.

Installing a new pragma

Create a pragmas method in your plugin that returns pairs of pragma names and methods to be called to process the pragma.

  sub pragmas {
    return (['mypragma' => \&do_my_pragma],
            ['another'  => \&another]);

  sub do_my_pragma {
    my ($app, $args);
    # Parse the arguments. You have access to
    # all of the methods in App::SimpleScan as
    # well as any subs defined here. You may 
    # want to export methods to the App::SimpleScan
    # namespace in your import() method.


Installing new command-line options

Create an options method in your plugin that returns a hash of options and variables to capture their values in. You will also want to export accessors for these variables to the App::SimpleScan namespace in your import.

  sub import {
    no strict 'refs';
    *{caller() . '::myoption} = \&myoption;

  sub options {
    return ('myoption' => \$myoption);

  sub myoption {
    my ($self, $value) = @_;
    $myoption = $value if defined $value;

Installing other modules via plugins

Create a test_modules method that returns a list of module names to be used by the generated test program.

  sub test_modules {
    return ('Test::Foo', 'Blortch::Zonk');

Adding extra code to the test output stack in a plugin

Create a per_test subroutine. This method gets called with the current App::SimpleScan::TestSpec object.

  sub per_test {
    $self->app->_stack_test(qw(fail "forced failure accessing";\n))
     if $self->uri =~ /;

Altering code/inserting code for every test stacked

Create a filter subroutine. This will get called with an array of strings corresponding to the code that's about to be stacked; you can do whatever additions or alterations you like. Just return your altered code as an array of strings; if you've added any tests to it, use the test_count() method in the app() object to up the test count appropriately.

Current plugins available

Currently, there are six simple_scan plugins available on CPAN:

Cache - the cache plugin extends simple_scan's caching to actually store the cached pages on disk. This allows subsequent runs to (if they choose) reuse pages that were fetched by previous runs. This is most useful in situations where you want to explore a number of different tests on a page, and you want to minimize the impact of your fetching the page to test it.
Forget - lets you drop a substitution from the substitutions you've defined during a test run.
Plaintext - lets you match vs. the plain text of a page, with the markup removed. Allows you to check content without worry if ot how it has been marked up.
Retry - lets you automatically retry a URL up to a set number of times before giving up. (It also adds a --retry command-line option which does the same.)
Snaphot - lets you set up to be able to automatically (either for every get, or only when there are errors) snapshot the page as it was when the GET request was made. This can be very useful in visually debugging problems with simple_scan tests. See App::SimpleScan::Plugin::Snapshot for detailed usage information.