Prerequisites

Software

UNIX shell, e.g. Windows Subsystem for Linux with Ubuntu (tested with WSL/Ubuntu 22.04 LTS and Debian 11 on the tape server)
Tesseract OCR (tested with version 4.1.x)
muPDF tools (tested with version 1.17)
GhostScript (tested with version 9.53)

Make sure the versions of tesseract, muPDF and GhostScript are up to date. Run the following command on your command line to update software components of your Linux installation:

...

Warning
Doesn't work as described here, still missing instructions to place the PDF/A instructions and ICC profile files.

Navigate to the staging directory containing the Production Master (_pm) image files
Create temporary subdirectories 'pdf' and 'pdf/pages', and navigate back to the staging directory:
Code Block
language bash
mkdir pdf cd pdf mkdir pages cd ..
Run tesseract OCR on all images in the staging directory:
Code Block
language bash
for i in *.tif; do tesseract -c tessedit_page_number=0 -l eng $i pdf/pages/${i%_pm.tif} pdf; done
Merge page-level PDF documents into one PDF per volume/issue:
Code Block
language bash
mutool merge -o pdf/combined-pdf.pdf pdf/pages/*.pdf

Produce the final PDF document.

Code Block

language	bash

gs -sDEVICE=pdfwrite \
    -dPDFA=2 \
    -dPDFACompatibilityPolicy=1 \
    -dNOSAFER \
    -dFastWebView \
    -sColorConversionStrategy=RGB \
    -dDownsampleColorImages=true \
    -dColorImageDownsampleThreshold=1.0 \
    -dAutoRotatePages=/None \
    -dColorImageResolution=150 -o downsampled-pdf.pdf /mnt/dps/staff_workspaces/mirko/pdfa/PDFA_def_UTL.ps combined-pdf.pdf

Delete the pdf subfolder and combined-pdf.pdf

Version	Old Version 11	New Version 12
Changes made by	MM Hanke	MM Hanke
Saved on	Oct 11, 2024	Oct 11, 2024

Versions Compared

Key

Prerequisites

Software

Content Comparison

Versions Compared

Key

Prerequisites

Software