Extracting metadata from multiple PDF documents

Adobe Acrobat allows you to add metadata to your PDFs, and the values you assign to these tags will be available for search engines and other services that rely on metadata. For example, you could add a value called "author" and assign the name of the document's author to that value. Also, Acrobat inserts system metadata values, such as the document creation date.

But let's say you have a bunch of PDF documents, and you need a quick way to see what metadata they contain. Wouldn't it be great if you could generate a report?

You can!

In fact, below I'll provide a script that will quickly generate a multi-column report of all metadata in an entire folder of PDFs, including all subfolders. But first, a bit of background on how this works.

How it works

While PDF is a proprietary format, the metadata is stored in plain text within the file. That means you can open a PDF in Notepad and see the metadata values. If you scroll past all of the strange characters, you eventually see some XML that contains both the custom metadata values you've defined as well as any system-defined values. They look like this...

  <xmp:ModifyDate>2012-11-14T14:41:58-05:00</xmp:ModifyDate>
  <pdfx:myAuthorName>Craig</pdfx:myAuthorName>

The fact that these are easily visible in a text editor means they are ripe for extraction! Any script capable of extracting information from a text file can be used to create a report from the values. Personally, I'm a huge fan of Perl. The regular expressions functions in Perl make it indespensible for technical writers who need to automate document tasks or create reports from text data. For more of my thoughts on how Perl can benefit technical writers, look here.

The code

Below is the Perl script I would use to create a report of metadata values from a folder full of PDF documents on my Windows PC. (If you don't have Perl installed, I recommend using the Active State Perl. It's a free download, and is easy to set up.)

After installing Perl, paste the following text into Notepad and save it as "extract_metadata_pdf.pl". Update "metadata1" and "metadata2" throughout the code with the names of your custom metadata tags. You can copy lines and update them for any additional custom metadata tags.

Instructions for running the script via a Windows command prompt are contained in the beginning of the .pl code.

Here's the code...


######################################################################################
# EXTRACT_PDF_METADATA.PL
#------------------------------------------------------------------------------------
# PURPOSE: Extract metadata from PDFs and generate a report with columns.
#
# USAGE: perl extract_pdf_metadata.pl [path]
# EXAMPLE: "perl extract_pdf_metadata.pl c:\pdfs > c:\pdfs\pdf_metadata.txt"
# ARGUMENTS PASSED TO SCRIPT:
# 0. [path] is the folder containing your PDFs.
# 1. The output file for the report (you can pick any path or filename you wish)
#
######################################################################################

#!/usr/bin/perl -w

use File::Find;
use File::stat;

#######################################
# Get list of PDF files in directory
#######################################

my $pdfdir = $ARGV[0];

my @filesarray;
find(sub { push @filesarray, $File::Find::name, if /\.pdf$/ }, $pdfdir);


###########################################
# Print some column headings for the output
###########################################

print "Filename\t";
print "metadata1\t";
print "metadata2\t";
print "Date modified\t";
print "Date created\n";

 
############################################
# Loop through files array and process them
############################################

my $metadata1;
my $metadata2;
my $datemodified;
my $datecreated;

foreach $file (@filesarray) {

 open (INFILE, "$file") || print "Error, could not read file $file. 
$!"; { undef $/; $contents = ; } close (INFILE) || die("Error, could not close file after reading.
$!"); ($metadata1) = $contents =~ m/pdfx:metadata1>(.+?)<\/pdfx:metadata1/si; ($outputID) = $contents =~ m/pdfx:metadata2>(.+?)<\/pdfx:metadata2/si; ($datemodified) = $contents =~ /xmp:ModifyDate>(.+?)T/si; ($datecreated) = $contents =~ /xmp:CreateDate>(.+?)T/si; print $file, "\t"; print $metadata1, "\t"; print $metadata2, "\t"; print $datemodified, "\t"; print $datecreated, "\t"; print "\n"; }

If you follow the instructions, you'll get a file called "pdf_metadata.txt" that contains all of the metadata from the PDFs in the folder you specified. You can also output to an Excel spreadsheet simply by changing the extension you specify when you run the script to ".xls".

Enjoy!