Skip to content

Latest commit

 

History

History
519 lines (346 loc) · 18.7 KB

README.md

File metadata and controls

519 lines (346 loc) · 18.7 KB

NAME

PDF::Data - Manipulate PDF files and objects as data structures

VERSION

version v1.2.0

SYNOPSIS

use PDF::Data;

DESCRIPTION

This module can read and write PDF files, and represents PDF objects as data structures that can be readily manipulated.

METHODS

new

my $pdf = PDF::Data->new(-compress => 1, -minify => 1);

Constructor to create an empty PDF::Data object instance. Any arguments passed to the constructor are treated as key/value pairs, and included in the $pdf hash object returned from the constructor. When the PDF file data is generated, this hash is written to the PDF file as the trailer dictionary. However, hash keys starting with "-" are ignored when writing the PDF file, as they are considered to be flags or metadata.

For example, $pdf->{-compress} is a flag which controls whether or not streams will be compressed when generating PDF file data. This flag can be set in the constructor (as shown above), or set directly on the object.

The $pdf->{-minify} flag controls whether or not to save space in the generated PDF file data by removing comments and extra whitespace from content streams. This flag can be used along with $pdf->{-compress} to make the generated PDF file data even smaller, but this transformation is not reversible.

clone

my $pdf_clone = $pdf->clone;

Deep copy the entire PDF::Data object itself.

new_page

my $page = $pdf->new_page;
my $page = $pdf->new_page('LETTER');
my $page = $pdf->new_page(8.5, 11);

Create a new page object with the specified size (in inches). Alternatively, certain page sizes may be specified using one of the known keywords: "LETTER" for U.S. Letter size (8.5" x 11"), "LEGAL" for U.S. Legal size (8.5" x 14"), or "A0" through "A8" for ISO A-series paper sizes. The default page size is U.S. Letter size (8.5" x 11").

copy_page

my $copied_page = $pdf->copy_page($page);

Deep copy a single page object.

append_page

$page = $pdf->append_page($page);

Append the specified page object to the end of the PDF page tree.

read_pdf

my $pdf = PDF::Data->read_pdf($file, %args);

Read a PDF file and parse it with $pdf->parse_pdf(), returning a new object instance. Any streams compressed with the /FlateDecode filter will be automatically decompressed. Unless the $pdf->{-decompress} flag is set, the same streams will also be automatically recompressed again when generating PDF file data.

parse_pdf

my $pdf = PDF::Data->parse_pdf($data, %args);

Used by $pdf->read_pdf() to parse the raw PDF file data and create a new object instance. This method can also be called directly instead of calling $pdf->read_pdf() if the PDF file data comes another source instead of a regular file.

write_pdf

$pdf->write_pdf($file, $time);

Generate and write a new PDF file from the current state of the PDF::Data object.

The $time parameter is optional; if not defined, it defaults to the current time. If $time is defined but false (zero or empty string), no timestamp will be set.

The optional $time parameter may be used to specify the modification timestamp to save in the PDF metadata and to set the file modification timestamp of the output file. If not specified, it defaults to the current time. If a false value is specified, this method will skip setting the modification time in the PDF metadata, and skip setting the timestamp on the output file.

pdf_file_data

my $pdf_file_data = $document->pdf_file_data($time);

Generate PDF file data from the current state of the PDF data structure, suitable for writing to an output PDF file. This method is used by the $pdf->write_pdf() method to generate the raw string of bytes to be written to the output PDF file. This data can be directly used (e.g. as a MIME attachment) without the need to actually write a PDF file to disk.

The optional $time parameter may be used to specify the modification timestamp to save in the PDF metadata. If not specified, it defaults to the current time. If a false value is specified, this method will skip setting the modification time in the PDF metadata.

dump_pdf

$pdf->dump_pdf($file, $mode);

Dump the PDF internal structure and data for debugging. If the $mode parameter is "outline", dump only the PDF internal structure without the data.

dump_outline

$pdf->dump_outline($file);

Dump an outline of the PDF internal structure for debugging. (This method simply calls the $pdf->dump_pdf() method with the $mode parameter specified as "outline".)

merge_content_streams

my $stream = $pdf->merge_content_streams($array_of_streams);

Merge multiple content streams into a single content stream.

find_bbox

$pdf->find_bbox($content_stream, $new);

Analyze a content stream to determine the correct bounding box for the content stream. The current implementation was purpose-built for a specific use case and should not be expected to work correctly for most content streams.

The $content_stream parameter may be a stream object or a string containing the raw content stream data.

The current algorithm breaks the content stream into lines, skips over various "neutral" lines and examines the coordinates specified for certain PDF drawing operators: "m" (moveto), "l" (lineto), "v" (curveto, initial point replicated), "y" (curveto, final point replicated), and "c" (curveto, all points specified).

The minimum and maximum X and Y coordinates seen for these drawing operators are used to determine the bounding box (left, bottom, right, top) for the content stream. The bounding box and equivalent rectangle (left, bottom, width, height) are printed.

If the $new boolean parameter is set, an updated content stream is generated with the coordinates adjusted to move the lower left corner of the bounding box to (0, 0). This would be better done by translating the transformation matrix.

new_bbox

$new_content = $pdf->new_bbox($content_stream);

This method simply calls the $pdf->find_bbox() method above with $new set to 1.

timestamp

my $timestamp = $pdf->timestamp($time);
my $now       = $pdf->timestamp;

Generate timestamp in PDF internal format.

UTILITY METHODS

round

my @numbers = $pdf->round(@numbers);

Round numeric values to 12 significant digits to avoid floating-point rounding error and remove trailing zeroes.

concat_matrix

my $matrix = $pdf->concat_matrix($transformation_matrix, $original_matrix);

Concatenate a transformation matrix with an original matrix, returning a new matrix. This is for arrays of 6 elements representing standard 3x3 transformation matrices as used by PostScript and PDF.

invert_matrix

my $inverse = $pdf->invert_matrix($matrix);

Calculate the inverse of a matrix, if possible. Returns undef if the matrix is not invertible.

translate

my $matrix = $pdf->translate($x, $y);

Returns a 6-element transformation matrix representing translation of the origin to the specified coordinates.

scale

my $matrix = $pdf->scale($x, $y);

Returns a 6-element transformation matrix representing scaling of the coordinate space by the specified horizontal and vertical scaling factors.

rotate

my $matrix = $pdf->rotate($angle);

Returns a 6-element transformation matrix representing counterclockwise rotation of the coordinate system by the specified angle (in degrees).

INTERNAL METHODS

validate

$pdf->validate;

Used by $pdf->new(), $pdf->parse_pdf() and $pdf->write_pdf() to validate some parts of the PDF structure. Currently, $pdf->validate() uses $pdf->validate_key() to verify that the document catalog and page tree root node exist and have the correct type, and that the page tree root node has no parent node. Then it calls $pdf->validate_page_tree() to validate the entire page tree.

By default, if a validation error occurs, it will be output as warnings, but the $pdf->{-validate} flag can be set to make the errors fatal.

validate_page_tree

my $count = $pdf->validate_page_tree($path, $page_tree_node);

Used by $pdf->validate(), and called by itself recursively, to validate the PDF page tree and its subtrees. The $path parameter specifies the logical path from the root of the PDF::Data object to the page subtree, and the $page_tree_node parameter specifies the actual page tree node data structure represented by that logical path. $pdf->validate() initially calls $pdf->validate_page_tree() with "Root/Pages" for $path and $pdf->{Root}{Pages} for $page_tree_node.

Each child of the page tree node (in $page_tree_node->{Kids}) should be another page tree node for a subtree or a single page node. In either case, the parameters used for the next method call will be "$path\[$i]" for $path (e.g. "Root/Pages[0][1]") and $page_tree_node->{Kids}[$i] for $page_tree_node (e.g. $pdf->{Root}{Pages}{Kids}[0]{Kids}[1]). These parameters are passed to either $pdf->validate_page_tree() recursively (if the child is a page tree node) or to $pdf->validate_page() (if the child is a page node).

After validating the page tree, $pdf->validate_resources() will be called to validate the page tree's resources, if any.

If the count of pages in the page tree is incorrect, it will be fixed. This method returns the total number of pages in the specified page tree.

validate_page

$pdf->validate_page($path, $page);

Used by $pdf->validate_page_tree() to validate a single page of the PDF. The $path parameter specifies the logical path from the root of the PDF::Data object to the page, and the $page parameter specifies the actual page data structure represented by that logical path.

This method will call $pdf->merge_content_streams() to merge the content streams into a single content stream (if $page->{Contents} is an array), then it will call $pdf->validate_content_stream() to validate the page's content stream.

After validating the page, $pdf->validate_resources() will be called to validate the page's resources, if any.

validate_resources

$pdf->validate_resources($path, $resources);

Used by $pdf->validate_page_tree(), $pdf->validate_page() and $pdf->validate_xobject() to validate associated resources. The $path parameter specifies the logical path from the root of the PDF::Data object to the resources, and the $resources parameter specifies the actual resources data structure represented by that logical path.

This method will call validate_xobjects for $resources->{XObject}, if set.

validate_xobjects

$pdf->validate_xobjects($path, $xobjects);

Used by $pdf->validate_resources() to validate form XObjects in the resources. The $path parameter specifies the logical path from the root of the PDF::Data object to the hash of form XObjects, and the $xobjects parameter specifies the actual hash of form XObjects represented by that logical path.

This method simply loops across all the form XObjects in $xobjects and calls $pdf->validate_xobject() for each of them.

validate_xobject

$pdf->validate_xobject($path, $xobject);

Used by $pdf->validate_xobjects() to validate a form XObject. The $path parameter specifies the logical path from the root of the PDF::Data object to the form XObject, and the $xobject parameter specifies the actual form XObject represented by that logical path.

This method verifies that $xobject is a stream and $xobject->{Subtype} is "/Form", then calls $pdf->validate_content_stream() with $xobject to validate the form XObject content stream, then calls $pdf->validate_resources() to validate the form XObject's resources, if any.

validate_content_stream

$pdf->validate_content_stream($path, $stream);

Used by $pdf->validate_page() and $pdf->validate_xobject() to validate a content stream. The $path parameter specifies the logical path from the root of the PDF::Data object to the content stream, and the $stream parameter specifies the actual content stream represented by that logical path.

This method calls $pdf->parse_objects() to make sure that the content stream can be parsed. If the $pdf->{-minify} flag is set, $pdf->minify_content_stream() will be called with the array of parsed objects to minify the content stream.

minify_content_stream

$pdf->minify_content_stream($stream, $objects);

Used by $pdf->validate_content_stream() to minify a content stream. The $stream parameter specifies the content stream to be modified, and the optional $objects parameter specifies a reference to an array of parsed objects as returned by $pdf->parse_objects().

This method calls $pdf->parse_objects() to populate the $objects parameter if unspecified, then it calls $pdf->generate_content_stream() to generate a minimal content stream for the array of objects, with no comments and only the minimum amount of whitespace necessary to parse the content stream correctly. (Obviously, this means that this transformation is not reversible.)

Currently, this method also performs a sanity check by running the replacement content stream through $pdf->parse_objects() and comparing the entire list of objects returned against the original list of objects to ensure that the replacement content stream is equivalent to the original content stream.

generate_content_stream

my $data = $pdf->generate_content_stream($objects);

Used by $pdf->minify_content_stream() to generate a minimal content stream to replace the original content stream. The $objects parameter specifies a reference to an array of parsed objects as returned by $pdf->parse_objects(). These objects will be used to generate the new content stream.

For each object in the array, this method will call an appropriate serialization method: $pdf->serialize_dictionary() for dictionary objects, $pdf->serialize_array() for array objects, or $pdf->serialize_object() for other objects. After serializing all the objects, the newly-generated content stream data is returned.

serialize_dictionary

$pdf->serialize_dictionary($stream, $hash);

Used by $pdf->generate_content_stream(), $pdf->serialize_dictionary() (recursively) and $pdf->serialize_array() to serialize a hash as a dictionary object. The $stream parameter specifies a reference to a string containing the data for the new content stream being generated, and the $hash parameter specifies the hash reference to be serialized.

This method will serialize all the key-value pairs of $hash, prefixing each key in the hash with "/" to serialize the key as a name object, and calling an appropriate serialization routine for each value in the hash: $pdf->serialize_dictionary() for dictionary objects (recursive call), $pdf->serialize_array() for array objects, or $pdf->serialize_object() for other objects.

serialize_array

$pdf->serialize_array($stream, $array);

Used by $pdf->generate_content_stream(), $pdf->serialize_dictionary() and $pdf->serialize_array() (recursively) to serialize an array. The $stream parameter specifies a reference to a string containing the data for the new content stream being generated, and the $array parameter specifies the array reference to be serialized.

This method will serialize all the array elements of $array, calling an appropriate serialization routine for each element of the array: $pdf->serialize_dictionary() for dictionary objects, $pdf->serialize_array() for array objects (recursive call), or $pdf->serialize_object() for other objects.

serialize_object

$pdf->serialize_object($stream, $object);

Used by $pdf->generate_content_stream(), $pdf->serialize_dictionary() and $pdf->serialize_array() to serialize a simple object. The $stream parameter specifies a reference to a string containing the data for the new content stream being generated, and the $object parameter specifies the pre-serialized object to be serialized to the specified content stream data.

This method will strip leading and trailing whitespace from the pre-serialized object if the $pdf->{-minify} flag is set, then append a newline to ${$stream} if appending the pre-serialized object would exceed 255 characters for the last line, then append a space to ${$stream} if necessary to parse the object correctly, then append the pre-serialized object to ${$stream}.

validate_key

$pdf->validate_key($hash, $key, $value, $label);

Used by $pdf->validate() to validate specific hash key values.

get_hash_node

my $hash = $pdf->get_hash_node($path);

Used by $pdf->validate_key() to get a hash node from the PDF structure by path.

parse_objects

my @objects = $pdf->parse_objects($objects, $data, $offset);

Used by $pdf->parse_pdf() to parse PDF objects into Perl representations.

parse_data

my @objects = $pdf->parse_data($data);

Uses $pdf->parse_objects() to parse PDF objects from standalone PDF data.

filter_stream

$pdf->filter_stream($stream);

Used by $pdf->parse_objects() to inflate compressed streams.

compress_stream

$new_stream = $pdf->compress_stream($stream);

Used by $pdf->write_object() to compress streams if enabled. This is controlled by the $pdf->{-compress} flag, which is set automatically when reading a PDF file with compressed streams, but must be set manually for PDF files created from scratch, either in the constructor arguments or after the fact.

resolve_references

$object = $pdf->resolve_references($objects, $object);

Used by $pdf->parse_pdf() to replace parsed indirect object references with direct references to the objects in question.

write_indirect_objects

my $xrefs = $pdf->write_indirect_objects($pdf_file_data, $objects, $seen);

Used by $pdf->write_pdf() to write all indirect objects to a string of new PDF file data.

enumerate_indirect_objects

$pdf->enumerate_indirect_objects($objects);

Used by $pdf->write_indirect_objects() to identify which objects in the PDF data structure need to be indirect objects.

enumerate_shared_objects

$pdf->enumerate_shared_objects($objects, $seen, $ancestors, $object);

Used by $pdf->enumerate_indirect_objects() to find objects which are already shared (referenced from multiple objects in the PDF data structure).

add_indirect_objects

$pdf->add_indirect_objects($objects, @objects);

Used by $pdf->enumerate_indirect_objects() and $pdf->enumerate_shared_objects() to add objects to the list of indirect objects to be written out.

write_object

$pdf->write_object($pdf_file_data, $objects, $seen, $object, $indent);

Used by $pdf->write_indirect_objects(), and called by itself recursively, to write direct objects out to the string of new PDF file data.

dump_object

my $output = $pdf->dump_object($object, $label, $seen, $indent, $mode);

Used by $pdf->dump_pdf(), and called by itself recursively, to dump (or outline) the specified PDF object.