=head1	NAME

ODF::lpOD::Document - General ODF package handling and metadata


This manual page describes the C<odf_document>, the common features of any
C<odf_part> of a C<odf_document>, and the particular features of the
C<odf_meta> and C<odf_manifest> parts (that handle the global document metadata
and the manifest of the associated container).

Every C<odf_document> is associated with a C<odf_container> that encapsulates
all the physical access logic. On the other hand, every C<odf_document> is
made of several components so-called I<parts>. The lpOD API is mainly focused
on parts that describe the global metadata, the text content, the layout and
the structure of the document, and that are physically stored according to an
XML schema. The common lpOD class for these parts is C<odf_xmlpart> (whose Perl
implementation is the C<ODF::lpOD::XMLPart> package).

lpOD provides specialized classes for the conventional ODF XML parts, namely
C<odf_meta>, C<odf_content>, C<odf_styles>, C<odf_settings>, C<odf_manifest>;
some of them provide methods dedicated to get or set the document metadata.

In order to process particular pieces of content in the most complex parts,
i.e. C<odf_content> and C<odf_styles>, the C<odf_element> class and its various
specialized derivatives are available. They are described in other chapters of
the lpOD documentation.

=head1  Document initialization

Any access to a document requires a valid C<odf_document> instance, that may be
created from an existing document or from scratch, using one of the constructors
introduced below. Once created, this instance gives access to individual parts
through the C<get_part()> method.

=head3  odf_get_document(uri)

This function creates a read-write document instance. The returned object is
associated to a physical existing ODF resource, which may be updated. The
required argument is the URI of the resource.

I<Note: in the present implementation, the URI argument must be either a
file path or a C<IO::Handle> corresponding to an open file or socket. The
physical resource must be a well formed compressed ODF file, such as those
natively produced by OpenOffice.org or compatible office software suites.>


        my $doc = odf_get_document("C:\MyDocuments\test.odt");

If the C<save> method of C<odf_document> is later used without explicit target,
the document is wrote back to the same resource.

=head3  odf_new_document_from_template(uri)

Same as C<odf_get_document>, but the ODF resource is used in read only mode,
i.e. it's used as a template in order to generate other ODF physical documents.

Some metadata of the new document are intialized to the following values:



the creation and modification dates are set to the current date;


the creator and initial creator are set to the owner of the current process
as reported by the operating system (if this information is available);


the number of editing cycles is set to 1;


the idenfication string of the current lpOD distribution is used as the
generator identifier string;


Each piece of metadata may be changed later by the application.

=head3  odf_new_document(doc_type)

Unlike other constructors, this one generates a C<odf_document> instance from
scratch. Technically, it's a variant of C<odf_new_document_from_template>, but
the default template (provided with the lpOD library) is used. The required
argument specifies the document type, that must be C<'text'>, C<'spreadsheet'>,
C<'presentation'>, or C<'drawing'>. The new document instance is not persistent;
no file is created before an explicit use of the C<save> method.

The following example creates a spreadsheet document instance:

        my $doc = odf_new_document('spreadsheet');

The real content of the instance depends on the default template.

A set of valid template ODF files (created using OpenOffice.org) is
transparently installed with the standard lpOD distribution. Advanced users may
use their own template files. To do so, they have to replace the ODF files
present in the C<templates> subdirectory of the lpOD installation; the path to
the lpOD installation may be retrieved through the C<lpod->installation_path>
common function. The user-provided template files must have the same names.

Some metadata are initialized in the same way as with

=head1  Document MIME type check and control

=head3  get_mimetype

Returns the MIME type of the document (i.e. the full string that identifies
the document type). An example of regular ODF MIME type is:


=head3  set_mimetype(new_mimetype)

Allows the user to force a new arbitrary MIME type (not to use in ordinary
lpOD applications !).

=head1  Access to individual document parts

=head3  get_part(name)

Generic C<odf_document> method allowing access to any I<part> of a previously
created document intance, including parts that are not handled by lpOD.
The lpOD library provides symbolic constants that represent the ODF usual

This instruction returns the I<CONTENT> part of a document as a C<odf_content>

        $content = $document->get_part(CONTENT);

With C<MIMETYPE> as argument, C<get_part()> returns the MIME type of the
document as a text string, i.e. the same result as C<get_mimetype()>.

This method may be used in order to get any other document part, such an
image or any other non-XML part. To do so, the real path of the needed part
must be specified instead of one of the XML part symbolic names. As an example,
the instruction below returns the binary content of an image:

        $img = $document->get_part('Pictures/logo.jpg');

In such a case, the method returns the data as an uninterpreted sequence of

(Remember that images files included in an ODF package are stored in a
C<Pictures> folder.)

Returns C<undef> if case of failure.

=head3  get_parts

Returns the list of the document parts.

=head1  Accessing data inside a part

Everything in the part is stored as a set of C<odf_element> instances. So, for
complex parts (such as C<CONTENT>) or parts that are not explictly covered in
the present documentation, the applications need to get access to an "entry
point" that is a particular element. The most used entry points are the C<root>
and the C<body>. Every part handler provides the C<get_root()> and C<get_body()>
methods, each one returning a C<odf_element> instance, that provides all the
element-based features (including the creation, insertion or retrieval of other
elements that may become in turn working contexts).

For those who know the ODF XML schema, two part-based methods allow the
selection of elements according to XPath expressions, namely C<get_element()>
and C<get_elements()>. The first one requires an I<XPath> expression and a
positional number; it returns the element corresponding to the given position
in the result set of the XPath expression (if any). The second one returns
the full result set (i.e. a list of C<odf_element> instances). For example,
the instructions below return respectively the first paragraph and all the
paragraphs of a part (assuming C<$part> is a previously selected document part):

        my $paragraph = $part->get_element('text:p', 0);
        my @paragraphs = $part->get_elements('text:p');

Note that the position argument of C<get_element> is zero-based, and that it
may be a negative value (if so, it specifies a position counted backward from
the last matching element, -1 being the position of the last one).

So a large part of the lpOD functionality is described with the C<odf_element>
class, i.e. L<ODF::lpOD::Element>.

=head1  Global document metadata

From the handler provided by the C<get_meta> document method, several metadata
of the document may be directly get or set.

=head2  Simple metadata accessors

Most metadata are just text strings. The user may read or write each one using
a C<get_xxx> or C<set_xxx> accessor, where "xxx" is the lpOD name of a
particular property. The presently supported simple properties are:



C<creation_date>: the date of the initial version of the document, expressed
in ISO-8601 date format


C<creator>: the name of the user who created the current version of the


C<description>: the long description of the document


C<editing_cycles>: the number of edit sessions (may be regarded as a version


C<editing_duration>: the total editing time through interactive software,
expressed as a time delta in ISO-8601 format


C<generator>: the signature of the application that created the document


C<initial_creator>: the name of the user who created the first version of the


C<language>: the ISO code of the main language used in the document


C<modification_date>: the date of the last modification (i.e. ot the current


C<subject>: the subject (or short description) of the document


C<title>: the title of the document.


When used without argument, some C<set> accessors may automatically set default
values, according to the capabilities of the runtime environment.
For C<set_creation_date()> and C<set_modification_date()>, the default
is the current system date. For C<set_creator()> and C<set_initial_creator()>,
the default is the identifier of the current system user. For
C<set_generator()> the default is the system name of the current program (as
it would appear in a command line) or, if not available, the current process
identifier. If the execution environment can't provide such informations, no
default value is provided. C<set_editing_cycles()>, without argument,
increments the C<editing_cycles> indicator by 1.

Both C<set_creation_date> and C<set_modification_date> allow the user to provide
the date in the ODF-compliant (ISO-8601) format, or in numeric format (like the
Perl C<time> format). In the second case, the provided time is automatically
converted in the required format. The corresponding C<get_> accessors always
return the dates in their storage format. However, the lpOD library provides
a C<numeric_date> that translates a regular ISO date into a Perl numeric
C<time> value (a symmetric C<iso_date> global function translates a Perl C<time>
into a ISO date).

Examples of use:

        $meta->set_title("The lpOD Cookbook");
        $meta->set_creator("The lpOD Project team");
        my $old_version = $meta->get_editing_cycles;
        $meta->set_editing_cycles($old_version + 1);

=head2  Document statistics

The global document statistics (as defined in the ยง3.1.18 of the ODF 1.1
specification) may be get or set using the C<get_statistics> and
C<set_statistics> accessors. The first one returns the statistic properties as
a hash reference. The second one takes a hash reference with the same structure,
containing the attribute names and values. The following example displays the
page count of the document (assuming it's a text document):

        my $meta = $document->get_meta;
        my $stat = $meta->get_statistics;
        say $meta->{'meta:page-count'};

Note that nothing prevents the applications from using C<get_statistics> to
set any arbitrary figures.

=head2  Keywords

The document metadata include a list of keywords (possibly empty). This list
may be used or changed.

=head3  get_keywords

Knowing that a document may be "tagged" by one or more keywords, C<odf_meta> provides a C<get_keywords> method that returns the list of the current keywords as a comma-separated string.

=head3  set_keywords(string_of_keywords)

C<set_keywords> allows the user to set a full list of keywords, provided as a single comma-separated string; the provided list replaces any previously existing keyword; this method, used without argument or with an empty string, just removes all the keywords. Example:

        $meta->set_keywords("ODF, OpenDocument, Python, Perl, Ruby, XML")

The spaces after the commas are ignored, and it's not possible to set a keyword that contains comma(s) through C<set_keywords>.

=head3  set_keyword(keyword)

C<set_keyword> appends a new, given keyword to the list; it's neutral if the given keyword is already present; it allows commas in the given keyword (but we don't recommend such a practice).

=head3  check_keyword(keyword)

C<check_keyword> returns C<TRUE> if its argument (which may be a regular expression)
matches an existing keyword, or C<FALSE> if the keyword is not present.

=head3  remove_keyword(expression)

C<remove_keyword> deletes any keyword that matches the argument (which may be a regular expression).

=head2  User-defined metadata

Each user-defined metadata element has a unique name (or key), a value and a datatype.

=head3  get_user_field(name)

Retrieves a user-defined field according to its name (that should be unique for
the document). In scalar context, returns the value of the field. In array
context, returns the value and the data type.

The regular ODF datatypes are C<float>, C<date>, C<time>, C<boolean>, and

=head3  get_user_fields

The C<odf_meta> API provides a C<get_user_fields> method that returns a list
whose each element is a hash ref whose (self-documented) keys are C<name>,
C<value>, and C<type>.

As an example, the following loop displays the name, the value and the type of
each use field in the matadata part of a document:

        my $doc = odf_get_document($source);
        my $meta = $doc->get_meta;
        foreach my $uf ($meta->get_user_fields) {
                say "Name   " . $uf->{name} .
                    "Value  " . $uf->{value} .
                    "Type   " . $uf->{type}

=head3  set_user_fields()

Allows the applications to set or change all the user-defined items.
Its argument is a list of hash refs with the same structure as the result of

=head3  set_user_field(name, value, type)

Creates or changes a user field. The first argument is the name (identifier).
The last argument is the data type, which must be ODF-compliant (see
C<get_user_field>). If the type is not specified, it's default value is
C<'string'>. If the type is C<date>, the value is automatically converted in
ISO-8601 format if provided as a numeric C<time> value.


        $meta->set_user_field("Development status", "Working draft");
        $meta->set_user_field("Security status", "Classified");
        $meta->set_user_field("Ready for release", FALSE, "boolean");

=head1  How to persistently update a document

Every part may be updated using specific methods that creates, change or remove
elements, but this methods don't produce any persistent effect.

The updates done in a given part may be either exported as an XML string, or
returned to the C<odf_document> instance from which the part depends. With the
first option, the user is responsible of the management of the exported XML
(that can't be used as is through a typical office application), and the
original document is not persistently changed. The second option instructs the
C<odf_document> that the part has been changed and that this change should be
reflected as soon as the physical resource is wrote back. However, a part-based
method can't directly update the resource. The changes may be made persistent
through a C<save()> method of the C<odf_document> object.

=head3  serialize

This part-based method returns a full XML export of the part. The returned XML
string may be stored somewhere and used later in order to create or replace a
part in another document, or to feed another application.

A C<pretty> named option may be provided. If set to C<TRUE>, this option
specifies that the XML export should be as human-readable as possible.

The example below returns a conveniently indented XML representation of the
content part of a document:

        $doc = odf_get_document("C:\MyDocuments\test.odt");
        $part = $doc->get_part(CONTENT);
        $xml = part->serialize(pretty => TRUE);

Note that this XML export is not affected by the content encoding/decoding
mechanism that works for user content, so it's character doesn't depend on
the custom text output character set possibly selected through the
C<set_output_charset()> method introduced in L<ODF::lpOD::Common>.

=head3  store

This part-based method stores the present state (possibly changed) of the part
in a temporary, non-persistent space, waiting for the execution of the next
call of the document-based C<save> method.

The following example selects the C<CONTENT> part of a document, removes the
last paragraph of this content, then sends back the changed content to the
document, that in turn is made persistent:

        $content = $document->get_part(CONTENT);
        $p = $content->get_body->get_paragraph(-1);

Like C<serialize()>, C<store()> allows the C<pretty> option.

=head3  add_file

This document-based method stores an external file "as is" in the document
container, without interpretation. The mandatory argument is the path of the
source file.

Optional named parameters C<path> and C<type> are allowed; C<path> specifies
the destination path in the ODF package, while C<type> is the MIME type of the
added resource.

As an example, the instruction below inserts a binary image file available
in the current directory in the "Thumbnails" folder of the document package:

        $document->add_file("logo.png", path => "Thumbnails/thumbnail.png");

If the C<path> parameter is omitted, the destination folder in the package is
either C<Pictures> if the source is identified as an image file (caution: such
a recognition may not work with any image type in any environment) or the root

The following example creates an entry whose every property is specified:

                path    => "Pictures/portrait.jpg",
                type    => "image/jpeg"

The return value is the destination path.

This method may be used in order to import an external XML file as a replacement
of a conventional ODF XML part without interpretation. As an example, the
following instruction replaces the C<STYLES> part of a document by an arbitrary

        $document->add_file("custom_styles.xml", path => STYLES);

Note that the physical effet of C<add_file()> is not immediate; the file is
really added (and the source is really required) only when the C<save()>
method, introduced below, is called. As a consequence, any update that could be
done in a document part loaded using C<add_file()> is lost. According to the
same logic, a document part loaded using C<add_file()> is never available in
the current document instance; it becomes available if the current instance
is made persistent through a C<save()> call then a new instance is created
using the saved package with C<odf_get_document>.

=head3  set_part

Allows the user to create or replace a document part using data in memory.
The first argument is the target ODF part, while the second one is the source

=head3  del_part

Deletes a part in the document package. The deletion is physically done through
the subsequent call of C<save()>. The argument may be either the symbolic
constant standing for a conventional ODF XML part or the real path of
the part in the package.

The following sequence replaces (without interpretation) the current document
content part by an external content:

        $document->add_file("/somewhere/stuff.xml", path => CONTENT);

Note that the order of these instructions is not significant; when C<save()>
is called, it executes all the deletions then all the part insertions and/or

=head3  save

This method is provided by the C<odf_document>. If the document instance is
associated with a regular ODF resource available for update (meaning that it
has been created using C<odf_get_container> and that the user has a write
access to the resource), the resource is wrote back and reflects all the
changes previously committed by one or more document parts using their
respective C<store> methods.

As an example, the sequence below updates a ODF file according to changes made
in the C<META> and C<CONTENT> parts:

        my $doc = odf_get_document("/home/users/jmg/report.odt");
        my $meta = $doc->get_part(META);
        my $content = $doc->get_part(CONTENT);

An optional C<target> parameter may be provided to C<save()>. If set, this
parameter specifies an alternative destination for the file (it produces the
same effect as the "File/Save As" feature of a typical office software).
The C<target> option is always allowed, but it's mandatory with C<odf_document>
instances created using a C<odf_new_document_from...> constructor.

=head1  Manifest

The manifest part of a document holds the list of the files included in the
container associated to the C<odf_document>. It's represented by a
C<odf_manifest> object, that is a particular C<odf_xmlpart>.

Each included file is represented by a C<odf_file_entry> object, whose
properties are



C<path>: full path of the file in the container;


C<type> : the media type (or MIME type) of the file.


=head2 Initialization

A C<odf_manifest> instance is created through the C<get_part()> method of
C<odf_document>, with C<MANIFEST> as part selector:

        $manifest = $document->get_part(MANIFEST);

=head2  Entry access

The full list of manifest entries may be obtained using C<get_entries()>.

It's possible to restrict the list with an optional C<type> parameter whose
value is a string of a regular expression. If C<type> is set, then the method
returns the entries whose media type string matches the given expression.

As an example, the first instruction below returns the entries that correspond
to XML parts only, while the next one returns all the XML entries, including
those whose type is not "text/xml" (such as "application/rdf+xml"), and the
last returns all the "image/xxx" entries (whatever the image format):

        @xmlp_entries = $manifest->get_entries(type => 'text/xml');
        @xml_entries = $manifest->get_entries(type => 'xml');
        @image_entries = $manifest->get_entries(type => 'image');

An individual entry may be selected according to its C<path>, knowing that the
path is the entry identifier. The C<get_entry()> method, whose mandatory
argument is the C<path>, does the job. The following instruction returns the
entry that stands for a given image resource included in the package (if any):

        $img_entry = $manifest->get_entry('Pictures/13BE2000BDD8EFA.jpg');

=head2  Entry creation and removal

Once selected, an entry may be deleted using the generic C<delete> method.
The C<del_entry()> method, whose mandatory argument is an entry path, deletes
the corresponding entry, if any. If the given entry doesn't exist, nothing is
done. The return value is the removed entry, or C<undef>.

A new entry may be added using the C<set_entry()> method. This method requires
a unique path as its mandatory argument. A C<type> optional named parameter
may be provided, but is not required; without C<type> specification, the media
type remains empty. This method returns the new entry object, or a null value
in case of failure. The example below adds an entry corresponding to an image

        $manifest->set_entry('Pictures/xyz.jpg', type => 'image/jpeg');

If C<set_entry()> is called with the same path as an existing entry, the old
entry is removed and replaced by the new one.

If the entry path is a folder, i.e. if its last character is "/", then the
media type is automatically set to an empty value. However, this rule doesn't
apply to the root folder, i.e. "/", whose type should be the MIME type of the

Beware: adding or removing a manifest entry doesn't automatically add or remove
the corresponding file in the container, and there is no automatic consistency
check between the real content of the part and the manifest.

=head2  Entry property handling

An individual manifest entry is a C<odf_file_entry> object, that is a
particular C<odf_element> object.

It provides the C<get_path()>, C<set_path()>, C<get_type()>, C<set_type()>
accessors, to get or set the C<path> and C<type> properties. There is no check
with C<set_type()>, so the user is responsible for the consistency between the
given type and the real content of the corresponding file. On the other hand,
C<set_path()> fails if the given C<path> is already used by another entry;
but there is no other check regarding this property, so the user must check the
consistency between the given path and the real path of the corresponding

If C<set_path()> puts a path whose last character is "/", the media type of
the entry is automatically set to an empty string. However, for users who know
exactly what they do, C<set_type()> allows to force a non-empty type I<after>


Copyright (c) 2010 Ars Aperta, Itaapy, Pierlis, Talend.

This work was sponsored by the Agence Nationale de la Recherche

lpOD is free software; you can redistribute it and/or modify it under
the terms of either:

a) the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option)
any later version.
lpOD is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with lpOD.  If not, see L<http://www.gnu.org/licenses/>.

b) the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at