Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Prerequisites

Software

Make sure the versions of tesseract, muPDF and GhostScript are up to date. Run the following command on your command line to update software components of your Linux installation:

...

Warning

Doesn't work as described here, still missing instructions to place the PDF/A instructions and ICC profile files: https://ghostscript.readthedocs.io/en/latest/VectorDevices.html#creating-a-pdf-a-document

The path under step 4 (using Ghostscript) needs to be valid for the OS/computer from which you are running the command. The PS file the path points to needs to be edited to contain a valid path pointing to an ICC profile.

  1. Navigate to the staging directory containing the Production Master (_pm) image files

  2. Create temporary subdirectories 'pdf' and 'pdf/pages', and navigate back to the staging directory:

    Code Block
    languagebash
    mkdir pdf
    cd pdf
    mkdir pages
    cd ..

    Run tesseract OCR on all images in the staging directory: 

    Code Block
    languagebash
    for i in *.tif; do tesseract -c tessedit_page_number=0 -l eng $i pdf/pages/${i%_pm.tif} pdf; done
  3. Merge page-level PDF documents into one PDF per volume/issue:

    Code Block
    languagebash
    mutool merge -o pdf/combined-pdf.pdf pdf/pages/*.pdf
  4. Produce the final PDF document.

    Code Block
    languagebash
    gs -sDEVICE=pdfwrite \
        -dPDFA=2 \
        -dPDFACompatibilityPolicy=1 \
        -dNOSAFER \
        -dFastWebView \
        -sColorConversionStrategy=RGB \
        -dDownsampleColorImages=true \
        -dColorImageDownsampleThreshold=1.0 \
        -dAutoRotatePages=/None \
        -dColorImageResolution=150 -o downsampled-pdf.pdf /mnt/dps/staff_workspaces/mirko/pdfa/PDFA_def_UTL.ps combined-pdf.pdf
  5. Delete the pdf subfolder and combined-pdf.pdf