++ed by:
1 non-PAUSE user
Author image Bryce Harrington


pkgfind - Spiders given URL(s) downloading wanted files


pkgfind -d /tmp/mydir -n hello -u http://directory.fsf.org/hello.html -p 1 -w "hello-[\d\.]+\.tar\.gz"


package_retriever scans a web or ftp site for newly posted files and downloads them to a local filesystem. It then prints out the file names of the files it downloads to stdout, suitable for passing to other tools in a pipe.

Specify the conditions to package_retriever's stdin, in this form:

 PACKAGE:     hotplug_cpu
 URL:         http://sr71.net/patches/
 DEPTH:       2
 WANTED:      patch-2\.6.*gz
 NOTWANTED:   patch-2\.6\.[89].*gz
 NOTWANTED:   patch-.*test\d*\.gz
 MIRRORS:     ufpr,jaist,superb-east,surfnet,mesh

The parameters allowed are as follows:

PACKAGE - A tag name. A subdirectory will be created by this name and the files placed into it. This also indicates the start of a new package definition, if you wish to define multiple packages in your input.

URL - The website or ftp site to scan.

DEPTH - The number of subdirectories to descend during the search. If not specified, a depth of 5 is assumed.

WANTED - A Perl Regular Expression indicating filenames that should be accepted. Multiple WANTED expressions can be used; the file must match at least ONE of these.

NOTWANTED - A Perl Regular Expression indicating filenames to ignore. Multiple NOTWANTED expressions can be used; the file must not match ANY of these.

MIRRORS - Causes MIRROR_URL, or URL if MIRROR_URL is not defined, to be modified to include a randomly selected mirror by substituting the strings "MIRROR" and "FILENAME". For example, to download a package from SourceForge, set URL to "http://prdownloads.sourceforge.net/crucible/FILENAME", then construct your MIRROR_URL like "http://MIRROR.dl.sourceforge.net/sourceforge/crucible/", and set MIRRORS to "switch,kent,mesh,surfnet" or whatever; pkgfind will select one of those four mirrors randomly and plug it into the URL.

MIRROR_URL - If this and MIRRORS is defined, will cause WWW::PkgFind to download packages from a url constructed by taking MIRROR_URL and substituting a randomly selected item from MIRRORS for the string "MIRROR".

RENAME - Allows specifying a regular expression to do a rename on the package after it is downloaded.

The motivation for this script is to poll places where developers post patches to software we're testing. The patches are then placed in a directory where other tools can pick them up and invoke regression tests on them, as appropriate.

This script was heavily derived from Judith Lebzelter's SourceSync module in the OSDL Patch Lifecycle Manager (PLM). Aside from breaking it out into a stand-alone tool, the other major change is the use of a plain text input file rather than a SQL database.


-V, --version

Prints the version and exits.

-h, --help

Prints a brief help message with option summary.


Prints a manual page (detailed help). Same as `perdoc tgen`

-D N, --debug=N

Prints debug messages. This expects a numerical argument corresponding to the debug message verbosity.

-d path, --destination=path

The directory path to store the downloads. The current directory is assumed by default. Subdirectories will be created for the packages specified in the input file.


Allows specification of a user agent comment. The user agent appears in the logs of the site that package_retriever accesses as something like:

  package_retriever/1.00 I<hostname> spider I<descriptive-comment>

The descriptive comment is used to communicate the purpose of pulling packages from the site. For example, you may wish to provide a URL to the location where the results from testing the package are posted, or a page showing contact information about you.

-n <packagename>, --n=<packagename>

The name of the package.

-u <url>, --url=<url>

The url to parse for packages.

-p <depth>, --depth=<depth>

The depth to parse for packages.

-w <wanted>, --wanted=<wanted>

The file type you want saved.

-W <notwanted>, --notwanted=<notwanted>

The file types you do not want saved.


This script requires the following perl modules: Pod::Usage, Getopt::Long, LWP::Simple, File::Spec.




None known.


Bryce Harrington <bryce@osdl.org>



Copyright (C) 2005 Bryce Harrington, OSDL. Copyright (C) 2005 Jon Phillips. All Rights Reserved.

This script is free software; you can redistribute it and/or modify it under the same terms as Perl itself.


Revision: $Revision: 1.4 $