Windows Wildcards Can Yield Extra Pages

Tuesday, June 21 2005 @ 10:27 AM PDT

Contributed by: Admin

Some pdftk users process hundreds of files. Performing this work on a Windows machine can yield unexpected results. The problem arises from the Windows command-prompt shell, not pdftk. The problem arises because for every long filename, Windows creates a short, DOS-compatible (8.3) filename. This short filename might end up matching a wildcard expression, even when the long filename does not. When using pdftk, the result is that you end up with more input files than you wanted.

This article offers a couple workarounds and then describes the case where this problem arose.

The Workarounds

One workaround is to use a wildcard expression that couldn't possibly match a short, DOS-style filename. DOS-style filenames have a maximum length of eight characters and an optional, maximum extension of three characters. They look something like this: 343990~1.PDF. In the case below, using the wildcard expression 343990_* solved the problem.

Another workaround is to use a shell other than the Windows command-prompt. I use bash as packaged by MSYS.

The Case

This problem arose in a case where a directory of input files contained 448 PDFs. Their numerical names had incrementing prefixes and suffixes, such as:

343959_0011.pdf
343959_0021.pdf
343959_0031.pdf
343990_0011.pdf
343990_0021.pdf
343990_0031.pdf
343990_0041.pdf
343991_0011.pdf
343991_0021.pdf
343991_0031.pdf
343992_0011.pdf
343992_0021.pdf
343992_0031.pdf
343993_0011.pdf
343993_0021.pdf
343993_0031.pdf
343994_0011.pdf
343994_0021.pdf
343994_0031.pdf
...

When using pdftk to combine these PDF files, extra files were showing up in the output PDF. For example, running:

pdftk input343990* cat output output343990.PDF

yields 343990.PDF which includes these files in this order:

343990_0011.pdf
343990_0021.pdf
343990_0031.pdf
343990_0041.pdf
345089_0131.pdf
345688_1121.pdf 

Is this a pdftk error or a shell error? Using dir shows that the shell is passing these unwanted files to pdftk:

dir 343990*

06/20/2005  03:58p               1,825 343990_0011.pdf
06/20/2005  03:58p               1,825 343990_0021.pdf
06/20/2005  03:58p               1,825 343990_0031.pdf
06/20/2005  03:58p               1,825 343990_0041.pdf
06/20/2005  03:58p               1,828 345089_0131.pdf
06/20/2005  03:58p               1,828 345688_1121.pdf

This mystery is solved by using the /X switch. This switch shows the DOS-compatible name on the left and the original, long filename on the right:

dir /X 343990*

06/20/2005  03:58p               1,825 343990~1.PDF    343990_0011.pdf
06/20/2005  03:58p               1,825 343990~2.PDF    343990_0021.pdf
06/20/2005  03:58p               1,825 343990~3.PDF    343990_0031.pdf
06/20/2005  03:58p               1,825 343990~4.PDF    343990_0041.pdf
06/20/2005  03:58p               1,828 343990~5.PDF    345089_0131.pdf
06/20/2005  03:58p               1,828 343990~6.PDF    345688_1121.pdf 

Thanks to Josh Gray at Daktronics who identified this problem and worked with me to solve it.

7 comments



http://www.accesspdf.com/article.php/20050621102725584