One of the nice side effects of running a PDF through pdftk is that the resulting PDF code is tidied up. Not only does this make the code easier to navigate in a text editor, but it also makes the PDF easier to process using traditional text tools such as grep.
For example, I had a PDF today that included page data (/ArtBox, /BleedBox, and /TrimBox) that was confusing Photoshop 6. When I went to rasterize one of these PDF pages, Photoshop would ignore the page cropping unless I removed this extra data from the PDF. I could do this manually in emacs, but I realized this was a job better suited for grep.
Pdftk creates tidy PDF code. First, it removes old artifacts from any incremental updates to the PDF (PDF Ref. 1.5 sec. 2.2.7). In Acrobat, this kind of cleanup is accomplished by performing a "Save As..." instead of just a "Save." Then, pdftk organizes PDF dictionaries by placing one key/value pair per line. Doing this is the first step for making my troublesome PDF grepable, because grep processes text line-by-line. You can tidy up PDF code by simply passing it through pdftk like so:
pdftk mydoc.pdf output mydoc.1.pdf
Or, if you plan to tinker with the PDF page streams, you might prefer:
pdftk mydoc.pdf output mydoc.1.pdf uncompress
The output will be ready to grep. Note that pdftk won't organize uncompressed page stream data for you. Even so, you will probably find it pretty tidy.
Grep is a command-line program that searches input files for a given pattern. It outputs the lines that match the pattern, and drops those that do not. Or, as in our case, we want it to do the inverse: drop lines that match our pattern. A simple command-line switch inverts the matching logic for us. Grep is free software.
Most non-Windows systems come with grep. Windows users can download grep from the GnuWin32 Project. Or, you can install an entire linux-like environment (including grep and bash) on your Windows machine with MSYS.
Alright, so you used pdftk to normalize your PDF code, and you have grep handy. Now let's strip that pesky data out of the PDF by running:
grep -a -v '^/[ABT][a-z]+Box' mydoc.1.pdf > mydoc.2.pdf
The -v tells grep to output lines that do not match the given pattern. The '^/[ABT][a-z]+Box' part is the pattern to match, and it is formatted in regular expression syntax (or "regex" for short). Recall we want to omit /ArtBox, /BleedBox, and /TrimBox dictionary entries (but we want to keep any /BBox entries). Knowing that, you can kindof see how this regex works.
If grep removed material from the PDF, then the PDF probably has a broken XREF table now. Running it through pdftk will fix it up, and this also gives you a chance to re-compress its page streams. E.g.:
pdftk mydoc.2.pdf output mydoc.3.pdf
Or:
pdftk mydoc.2.pdf output mydoc.3.pdf compress
These three steps can be strung into single command line by using pipes like so:
pdftk mydoc.pdf output - | grep -a -v '^/[ABT][a-z]+Box' | pdftk - output mydoc.new.pdf
Lovely!
All sorts of things. For example, vary our above work to discover page sizes used in your PDF. This next pattern will yield all of the /MediaBox and /CropBox entries from mydoc.pdf:
pdftk mydoc.pdf output - | grep '^/[MC][a-z]+Box' > mydoc.page_boxes.txt
Note that the output from these commands won't tell you which pages they are from, since page ordering in the PDF code can be arbitrary.
If you are having fun grepping PDF, you should also look into sed and gawk.
Sed is the stream editor, and it is more powerful than grep.
I just had a chance to play with sed on Windows. The sed that comes with MSYS does work, but you must call it from the MSYS bash shell (I used the MS command shell during initial testing, I don't know why) and you must use the -c option. The -c option turns off r$ translation, and it must be the first switch on the command line.
Here is a sed example based on a question from Maarten. It replaces all [0 0 612 792] MediaBox arrays with [0 0 500 500]. The -c option is necessary on Windows; don't use it on Linux:
sed -c '/MediaBox/s/[0 0 612 792]/[0 0 500 500]/g' mydoc.1.pdf > mydoc.2.pdf
Or, maybe you have a variety of MediaBox arrays and you want to replace them all with [0 0 500 500]. In that case, use a regex instead of [0 0 612 792]:
sed -c '/MediaBox/s/[[0-9 ]*]/[0 0 500 500]/g' mydoc.1.pdf > mydoc.2.pdf
In practice, you might also need to address the /CropBox page entries. And don't forget to pass your PDF through pdftk before and after using sed, as we did with grep, above.