++ed by:

2 PAUSE users
2 non-PAUSE users.

Author image David Iberri


HTML::WikiConverter - An HTML to wiki markup converter


  use HTML::WikiConverter;
  my $wc = new HTML::WikiConverter( dialect => 'MediaWiki' );
  print $wc->html2wiki( $html );


HTML::WikiConverter is an HTML to wiki converter. It can convert HTML source into a variety of wiki markups, called wiki "dialects".


  my $wc = new HTML::WikiConverter( dialect => $dialect, %attrs );

Returns a converter for the specified dialect. Dies if $dialect is not provided or is not installed on your system. (See "Supported dialects" for a list of supported dialects.) Additional parameters are optional and can be included in %attrs:

    URI to use for converting relative URIs to absolute ones

    URI used in determining which links are wiki links. For example,
    the English Wikipedia would use 'http://en.wikipedia.org/wiki/'

    Helps C<HTML::TreeBuilder> parse HTML fragments by wrapping HTML
    in <html> and </html> before passing it through html2wiki()
  my $wiki = $wc->html2wiki( $html );

Converts the HTML source into wiki markup for the current dialect.

  my $html = $wc->parsed_html;

Returns the HTML representative of the last-parsed syntax tree. Use this to see how your input HTML was parsed internally, which is often useful for debugging.

  my $base_uri = $wc->base_uri;
  $wc->base_uri( $new_base_uri );

Gets or sets the base_uri option used for converting relative to absolute URIs.

  my $wiki_uri = $wc->wiki_uri;
  $wc->wiki_uri( $new_wiki_uri );

Gets or sets the wiki_uri option used for determining which links are links to wiki pages.

  my $wrap_in_html = $wc->wrap_in_html;
  $wc->wrap_in_html( $new_wrap_in_html );

Gets or sets the wrap_in_html option used to help HTML::TreeBuilder parse (broken) fragments of HTML that aren't contained within a parent element. For example, the following HTML fragment causes trouble:

  Hello<br> goodbye.

This is parsed by HTML::TreeBuilder as:

      <p><~text text="Hello"></~text><br>

Note that the string " goodbye" is missing. This can be resolved by wrapping the HTML fragment in a parent element. In many cases a <p> tag is appropriate, but it the general case <html> is preferred: it has no meaning to wiki dialects and therefore has very little chance of interfering with HTML-to-wiki conversion.


These methods are for use only by dialect modules.

  my $wiki = $wc->get_elem_contents( $node );

Converts the contents of $node (i.e. its children) into wiki markup and returns the resulting wiki markup.

  my $title = $wc->get_wiki_page( $url );

Attempts to extract the title of a wiki page from the given URL, returning the title on success, undef on failure. If wiki_uri is empty, this method always return undef. Assumes that URLs to wiki pages are constructed using <wiki-uri><page-name>.

  my $ok = $wc->is_camel_case( $str );

Returns true if $str is in CamelCase, false otherwise. CamelCase-ness is determined using the same rules as CGI::Kwiki's formatting module uses.

  my $attr_str = $wc->get_attr_str( $node, @attrs );

Returns a string containing the specified attributes in the given node. The returned string is suitable for insertion into an HTML tag. For example, if $node refers to the HTML

  <style id="ht" class="head" onclick="editPage()">Header</span>

and @attrs contains "id" and "class", then get_attr_str will return 'id="ht" class="head"'.


HTML::WikiConverter can convert HTML into markup for a variety of wiki engines. The markup used by a particular engine is called a wiki markup dialect. Support is added for dialects by installing dialect modules which provide the rules for how HTML is converted into that dialect's wiki markup.

Dialect modules are registered in the HTML::WikiConverter:: namespace an are usually given names in CamelCase. For example, the rules for the MediaWiki dialect are provided in HTML::WikiConverter::MediaWiki. And PhpWiki is specified in HTML::WikiConverter::PhpWiki.

Supported dialects

HTML::WikiConverter supports conversions for the following dialects:


While under most conditions the each will produce satisfactory wiki markup, the complete syntactic sugar of each dialect has not yet been implemented. Suggestions, especially in the form of patches, are very welcome.

Of these, the MediaWiki dialect is probably the most complete. I am a Wikipediholic, after all. :-)

Conversion rules

To interface with HTML::WikiConverter, dialect modules must define a single rules class method. It returns a reference to a hash of rules that specify how individual HTML elements are converted to wiki markup. The following rules are recognized:





For example, the following rules method could be used for a wiki dialect that uses *asterisks* for bold and _underscores_ for italic text:

  sub rules {
    return {
      b => { start => '*', end => '*' },
      i => { start => '_', end => '_' }

To add <strong> and <em> as aliases of <b> and <i>, use the 'alias' rule:

  sub rules {
    return {
      b => { start => '*', end => '*' },
      strong => { alias => 'b' },

      i => { start => '_', end => '_' },
      em => { alias => 'i' }

(If you specify the 'alias' rule, no other rules are allowed.)

Many wiki dialects separate paragraphs and other block-level elements with a blank line. To indicate this, use the 'block' keyword:

  p => { block => 1 }

(Note that if a block-level element is nested inside another block-level element, blank lines are only added to the outermost block-level element.)

However, many such wiki engines require that the text of a paragraph be contained on a single line of text. Or that a paragraph cannot contain any blank lines. These formatting options can be specified using the 'line_format' keyword, which can be assigned the value 'single', 'multi', or 'blocks'.

If the element must be contained on a single line, then the 'line_format' option should be 'single'. If the element can span multiple lines, but there can be no blank lines contained within, then it should be 'multi'. If blank lines (which delimit blocks) are allowed, then it should be 'blocks'. For example, paragraphs are specified like so in the MediaWiki dialect:

  p => { block => 1, line_format => 'multi', trim => 1 }

The 'trim' option indicates that leading and trailing whitespace should be stripped from the paragraph before other rules are processed. You can use 'trim_leading' and 'trim_trailing' if you only want whitespace trimmed from one end of the content.

Some multi-line elements require that each line of output be prefixed with a particular string. For example, preformatted text in the MediaWiki dialect is prefixed with one or more spaces. This is specified using the 'line_prefix' option:

  pre => { block => 1, line_prefix => ' ' }

In some cases, conversion from HTML to wiki markup is as simple as string replacement. When you want to replace a tag and its contents with a particular string, use the 'replace' option. For example, in the PhpWiki dialect, three percent signs '%%%' represents a linebreak <br>, hence the rule:

  br => { replace => '%%%' }

(If you specify the 'replace' option, no other options are allowed.)

Finally, many wiki dialects allow a subset of HTML in their markup, such as for superscripts, subscripts, and text centering. HTML tags may be preserved using the 'preserve' option. For example, to allow the <font> tag in wiki markup, one might say:

  font => { preserve => 1 }

(The 'preserve' rule cannot be combined with the 'start' or 'end' rules.)

Preserved tags may also specify a whitelist of attributes that may also passthrough from HTML to wiki markup. This is done with the 'attributes' option:

  font => { preserve => 1, attributes => [ qw/ font size / ] }

(The 'attributes' rule must be used in conjunction with the 'preserve' rule.)

Some HTML elements have no content (e.g. line breaks), and should be preserved specially. To indicate that a preserved tag should have no content, use the 'empty' rule. This will cause the element to be replaced with "<tag />", with no end tag and any attributes you specified. For example, the MediaWiki dialect handles line breaks like so:

  br => {
    preserve => 1,
    attributes => qw/ id class title style clear /,
    empty => 1

This will convert, e.g., "<br clear='both'>" into "<br clear='both' />". Without specifying the 'empty' rule, this would be converted into the undesirable "<br clear='both'></br>".

(The 'empty' rule requires that 'preserve' is also specified.)

Dynamic rules

Instead of simple strings, you may use coderefs as option values for the 'start', 'end', 'replace', and 'line_prefix' rules. If you do, the code will be called with three arguments: 1) the current HTML::WikiConverter instance, 2) the current HTML::Element node, and 3) the rules for that node (as a hashref).

Specifying rules dynamically is often useful for handling nested elements. For example, the MoinMoin dialect uses the following rules for lists:

  ul => { line_format => 'multi', block => 1, line_prefix => '  ' }
  li => { start => \&_li_start, trim_leading => 1 }
  ol => { alias => 'ul' }

It then defines _li_start like so:

  sub _li_start {
    my( $wc, $node, $rules ) = @_;
    my $bullet = '';
    $bullet = '*'  if $node->parent->tag eq 'ul';
    $bullet = '1.' if $node->parent->tag eq 'ol';
    return "\n$bullet ";

This ensures that every unordered list item is prefixed with '*' and every ordered list item is prefixed with '1.', per the MoinMoin markup. It also ensures that each list item is on a separate line and that there is a space between the prefix and the content of the list item.

Rule validation

Certain rule combinations are not allowed. For example, the 'replace' and 'alias' rules cannot be combined with any other rules, and 'attributes' can only be specified alongside 'preserve'. Invalid rule combinations will trigger an error when the dialect module is loaded.


The first step in converting HTML source to wiki markup is to parse the HTML into a syntax tree using HTML::TreeBuilder. It is often useful for dialects to preprocess the tree prior to converting it into wiki markup. Dialects that elect to preprocess the tree do so by defining a preprocess_node class method, which will be called on each node of the tree (traversal is done in pre-order). The method receives three arguments: 1) the dialect's package name, 2) the current HTML::WikiConverter instance, and 3) the current HTML::Element node being traversed. It may modify the node or decide to ignore it. The return value of the preprocess_node method is not used.

Because they are so commonly needed, two preprocessing steps are automatically carried out by HTML::WikiConverter, regardless of the dialect: 1) relative URIs in images and links are converted to absolute URIs (based upon the 'base_uri' parameter), and 2) ignorable text (e.g. between </td> and <td>) is discarded.


Once the work of converting HTML, it is sometimes useful to postprocess the resulting wiki markup. Postprocessing can be used to clean up whitespace, fix subtle bugs in the markup that can't otherwise be done in the original conversion, etc.

Dialects that want to postprocess the wiki markup should define a postprocess_output class method that will be called just before HTML::WikiConverter's output method returns to the client. The method will be passed three arguments: 1) the dialect's package name, 2) the current HTML::WikiConverter instance, and 3) a reference to the wiki markup. It may modify the wiki markup that the reference points to. The return value of postprocess_output is ignored.

For example, to convert a series of line breaks to be replaced with a pair of newlines, a dialect might implement this:

  sub postprocess_output {
    my( $pkg, $wc, $outref ) = @_;
    $$outref =~ s/<br>\s*<br>/\n\n/g;

(This example assumes that HTML line breaks were replaced with <br> in the wiki markup.)


Please report bugs using http://rt.cpan.org.


HTML::TreeBuilder, HTML::Element


David J. Iberri <diberri@yahoo.com>


Copyright (c) 2004-2005 David J. Iberri

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html