Not so much an issue as a note: recent versions of Tesseract can recognize text and output the result in a PDF overlay: ``` tesseract infile.tif outfile pdf ``` This saves mucking about with hocr or any other intermediate text format.