Renard::Incunabula::MuPDF::mutool - Retrieve PDF image and text data via MuPDF's mutool


version 0.004



  _call_mutool( @args )

Helper function which calls mutool with the contents of the @args array.

Returns the captured STDOUT of the call.

This function dies if mutool unsuccessfully exits.


  get_mutool_pdf_page_as_png($pdf_filename, $pdf_page_no)

This function returns a PNG stream that renders page number $pdf_page_no of the PDF file $pdf_filename.


  get_mutool_text_stext_raw($pdf_filename, $pdf_page_no)

This function returns an XML string that contains structured text from page number $pdf_page_no of the PDF file $pdf_filename.

The XML format is defined by the output of mutool looks like this (for page 23 of the pdf_reference_1-7.pdf file):

  <?xml version="1.0"?>
  <document name="(null)">
    <page height="666" width="531">
      <block bbox="261.18 616.16397 269.77766 625.2532">
        <line bbox="261.18 616.16397 269.77766 625.2532" dir="1 0" wmode="0">
          <font name="MyriadPro-Semibold" size="7.98">
            <char bbox="261.18 616.16397 265.45729 625.2532" c="2" x="261.18" y="623.2582"/>
            <char bbox="265.50038 616.16397 269.77766 625.2532" c="3" x="265.50038" y="623.2582"/>
      <block bbox="225.78 88.20229 305.18159 117.93829">
        <line bbox="225.78 88.20229 305.18159 117.93829" dir="1 0" wmode="0">
          <font name="MyriadPro-Bold" size="24">
            <char bbox="225.78 88.20229 239.724 117.93829" c="P" x="225.78" y="111.93829"/>
            <char bbox="239.5176 88.20229 248.63759 117.93829" c="r" x="239.5176" y="111.93829"/>
            <char bbox="248.4552 88.20229 261.1272 117.93829" c="e" x="248.4552" y="111.93829"/>
            <char bbox="261.1128 88.20229 269.29679 117.93829" c="f" x="261.1128" y="111.93829"/>

Simplified, the high-level structure looks like:

  <page> -> [list of blocks]
    <block> -> [list of blocks]
      a block is either:
        - stext
            <line> -> [list of lines] (all have same baseline)
              <font> -> [list of fonts] (horizontal spaces over a line)
                <char> -> [list of chars]
        - image
            # TODO document the image data from mutool


  get_mutool_text_stext_xml($pdf_filename, $pdf_page_no)

Returns a HashRef of the structured text from from page number $pdf_page_no of the PDF file $pdf_filename.

See the function get_mutool_text_stext_raw for details on the structure of this data.



Returns an XML string of the page bounding boxes of PDF file $pdf_filename.

The data is in the form:

    <page pagenum="1">
      <MediaBox l="0" b="0" r="531" t="666" />
      <CropBox l="0" b="0" r="531" t="666" />
      <Rotate v="0" />
    <page pagenum="2">



Returns a HashRef containing the page bounding boxes of PDF file $pdf_filename.

See function get_mutool_page_info_raw for information on the structure of the data.


  fun get_mutool_outline_simple($pdf_filename)

Returns an array of the outline of the PDF file $pdf_filename as an ArrayRef[HashRef] which corresponds to the items attribute of Renard::Incunabula::Outline.


  fun get_mutool_get_trailer_raw($pdf_filename)

Returns the trailer of the PDF file $pdf_filename as a string.


  fun get_mutool_get_object_raw($pdf_filename, $object_id)

Returns the object given by the ID $object_id for PDF file $pdf_filename as a string.


  fun get_mutool_get_info_object_parsed( $pdf_filename )

Returns the document information dictionary as a Renard::Incunabula::MuPDF::mutool::ObjectParser object.

See Table 10.2 on pg. 844 of the PDF Reference, version 1.7 to see the entries that usually used (e.g., Title, Author).


