Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Current »

Prerequisites

Software

Make sure the versions of tesseract, muPDF and GhostScript are up to date. Run the following command on your command line to update software components of your Linux installation:

sudo apt update && sudo apt upgrade

Input files and staging space

This process presumes that the image files used to run OCR will be cropped and QC'd TIF files, named in sequential order. Avoid using this method on JPEG files, as this could result in significant loss of image quality.

Page images need to be staged in one subdirectory per volume/issue.

Ensure that the staging area used to process the images has enough space available. The process will create intermediary files (page-level PDFs and an uncompressed/unoptimized aggregate PDF document), which can take up hundreds of megabytes.

Processing instructions

Doesn't work as described here, still missing instructions to place the PDF/A instructions and ICC profile files: https://ghostscript.readthedocs.io/en/latest/VectorDevices.html#creating-a-pdf-a-document

  1. Navigate to the staging directory containing the Production Master (_pm) image files

  2. Create temporary subdirectories 'pdf' and 'pdf/pages', and navigate back to the staging directory:

    mkdir pdf
    cd pdf
    mkdir pages
    cd ..

    Run tesseract OCR on all images in the staging directory: 

    for i in *.tif; do tesseract -c tessedit_page_number=0 -l eng $i pdf/pages/${i%_pm.tif} pdf; done
  3. Merge page-level PDF documents into one PDF per volume/issue:

    mutool merge -o pdf/combined-pdf.pdf pdf/pages/*.pdf
  4. Produce the final PDF document.

    gs -sDEVICE=pdfwrite \
        -dPDFA=2 \
        -dPDFACompatibilityPolicy=1 \
        -dNOSAFER \
        -dFastWebView \
        -sColorConversionStrategy=RGB \
        -dDownsampleColorImages=true \
        -dColorImageDownsampleThreshold=1.0 \
        -dAutoRotatePages=/None \
        -dColorImageResolution=150 -o downsampled-pdf.pdf /mnt/dps/staff_workspaces/mirko/pdfa/PDFA_def_UTL.ps combined-pdf.pdf
  5. Delete the pdf subfolder and combined-pdf.pdf

  • No labels