Post-processing: Automated OCR and PDF creation
Prerequisites
Software
UNIX shell, e.g. Windows Subsystem for Linux with Ubuntu (tested with WSL/Ubuntu 22.04 LTS and Debian 11 on the tape server)
Tesseract OCR (tested with version 4.1.x)
muPDF tools (tested with version 1.17)
GhostScript (tested with version 9.53)
Make sure the versions of tesseract, muPDF and GhostScript are up to date. Run the following command on your command line to update software components of your Linux installation:
sudo apt update && sudo apt upgrade
Input files and staging space
This process presumes that the image files used to run OCR will be cropped and QC'd TIF files, named in sequential order. Avoid using this method on JPEG files, as this could result in significant loss of image quality.
Page images need to be staged in one subdirectory per volume/issue.
Ensure that the staging area used to process the images has enough space available. The process will create intermediary files (page-level PDFs and an uncompressed/unoptimized aggregate PDF document), which can take up hundreds of megabytes.
Processing instructions
Doesn't work as described here, still missing instructions to place the PDF/A instructions and ICC profile files: High Level Devices — Ghostscript 10.05.0 documentation
The path under step 4 (using Ghostscript) needs to be valid for the OS/computer from which you are running the command. The PS file the path points to needs to be edited to contain a valid path pointing to an ICC profile.
Navigate to the staging directory containing the Production Master (_pm) image files
Create temporary subdirectories 'pdf' and 'pdf/pages', and navigate back to the staging directory:
mkdir pdf cd pdf mkdir pages cd ..
Run tesseract OCR on all images in the staging directory:
for i in *.tif; do tesseract -c tessedit_page_number=0 -l eng $i pdf/pages/${i%_pm.tif} pdf; done
Merge page-level PDF documents into one PDF per volume/issue:
Produce the final PDF document.
Delete the pdf subfolder and combined-pdf.pdf
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.