Curious case of combining pdfs

Submitted by hemanth on Fri, 09/24/2010 - 18:36

The case:

Few many chapters of some study material,files like chapter1.pdf, chapter2.pdf....chapterN.pdf, was easy to loop and wget the files as there was a easy to crack URL pattern, so i just had to do something like:

url=myurl
for file in chapter{1.100}
do
wget $url/$file.pdf 
done

The bad cat and good cat :
Indeed as an GNU/Linux lover cat is always handy, indeed the issue with pdf unlike ps is cat 1.pdf 2.pdf > 12.pdf will just give us 2.pdf. The very well know and common tool for combining pdf is indeed pdftk.

Tour with pdftk :
pdftk indeed a handy tool for manipulating PDF's was not so handy, until a easy work around was figured out for this case.

The issue:
After wgetting themall, the directory has files like:

chapter10.pdf  chapter15.pdf  chapter1.pdf   chapter24.pdf  chapter5.pdf
chapter11.pdf  chapter16.pdf  chapter20.pdf  chapter25.pdf  chapter6.pdf
chapter12.pdf  chapter17.pdf  chapter21.pdf  chapter2.pdf   chapter7.pdf
chapter13.pdf  chapter18.pdf  chapter22.pdf  chapter3.pdf   chapter8.pdf
chapter14.pdf  chapter19.pdf  chapter23.pdf  chapter4.pdf   chapter9.pdf

As its clear that the ordering is not as required,

pdftk *.pdf cat output combined.pdf

would indeed mess up!

The work around:

Trial 1: Numbers padded with zeros.
Felt like a easy and straight forward way to fix this mess is to pad the number with zeros as :

for name in c*.pdf; 
do 
num=${name//[![:digit:]]}
newname=$(printf "C%03i.pdf" $num)
echo mv "${name}" "${newname}"
mv "${name}" "${newname}"
done

And then do a

pdftk *.pdf cat output combined.pdf

This was indeed a round about and unnecessary way of resolving this issue! That was only realized in trail 2

Trial : BASH globing saves the day

Using globing it was very easy to reduce the whole exercise into a simple line as :

pdftk chapter[0-9].pdf chapter[1-9][0-9].pdf cat output mixed.pdf

More deeper look into trial 2
c[0-9].pdf expands to any files that matches that pattern.But something like c{1..10} generates 10 words, it does not try to match it to filenames.

To make it more clear here is an example:

# touch c1.pdf c4.pdf c12.pdf; 
echo c[0-9].pdf c[1-9][0-9].pdf; echo c*.pdf

output: c1.pdf c4.pdf c12.pdf
output: c1.pdf c12.pdf c4.pdf

So this ends the case of combing pdfs, please do share your experiences below!

hemanth's blog

Hemanth.HM's Experiments on web, CLI, GNU/Linux and more

Curious case of combining pdfs

Recent blog posts