Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Multiexcerpt include
Batch ingest for paged resources is different enough it has its own dedicated space, it is meant to supplement the batch ingest instructions found here that are more comprehensive. 
MultiExcerptNameBatch ingest general instructions
PageWithExcerptBatch ingest simple assets


Note

When you are ingesting PDF files as paged content with the intention to have the document split up into individual pages, prepare your source files as follows:

  • Stage only the PDF file and applicable metadata or derivative datastreams. Do not stage page images along with the PDF. The DAMS software will not attempt to split a PDF file into page images if you ingest a PDF document together with page image files.
  • Remove spaces from PDF filenames. If you stage a PDF file with a name containing spaces for ingest, the DAMS software will not create page images.
Info
titleBatch Constraints
  • If you have your own PDF datastream provided there are not any hard constraints on upper limits for the number of pages though we recommend remaining under 500 pages per batch submission for right now.
  • If you are not supplying your own PDF datastream then please keep the number of pages 10 or fewer. If you need to ingest an object larger than this either provide your own pdf or split up your resource to be ingested into multiple batch submissions.

Directory Structure for Batch Ingest of Paged Content

...

ArgumentValue AssociatedPurposeAccepted Data TypeAdditional Notes
MODSname of your mods fileprovide MODS for your resourcexml
TNthumbnail pictureprovide thumbnail picture for your resourcepng, jpg, jpeg
LANGone of the supported language codes hereTo have OCR created for each pagetext
PAGE + NUMBERname of the file with the page object

actual page content


tiff, tif, jp2, jpg, jpeg
PAGE + OCR_CUSTOMname of your ocr file for that pageallows you to provide your own OCRtextfile
FULL_TEXT_CUSTOMname of your full text for your pdf fileallows you to provide your own FULL_TEXT for the pdftextfile
PDFname of your pdf filePDF for resource, will be cut up into individual pages if individual pages are not providedpdf fileDo not provide PDFs at the page level. Only provide them at the book/issue level. If you do provide them we cannot guarantee they will not be overwritten by inferior quality system generated PDFs.
HOSTPUBLICATIONpidAdd issue(s) to publicationtext
HOSTISSUEpidAdd pages to an issuetext
HOSTBOOKpidAdd pages to a booktext

...

Sample manifest (datastreams.txt) file for a book and/or publication:

Panel
borderColorgrey
titleSample manifest (datastreams.txt)

MODS==modsfilename.xml

TN==thumbnailfilename.jpg

OTHERDATASTREAM==filename.ext

Adding Custom Datastreams

You can provided custom datastreams with your own custom defined naming convention. Please do not use any of the Restricted Datastream IDs and add the suffix '_CUSTOM' to your Datastream ID. 

MODS System Generated Fields 

  • keydate: if batch ingesting paged content, users need to add keyGen="yes" attribute to appropriate MODS element (dateCreated or dateIssued) to indicate which is keydate. If there is no keyGen, issues will appear to be missing from your Publication. Find issues with missing keydates is by going to the publication > Manage > Publication. Fix problem by editing MODS for problematic issue. 
  • recordCreationDate
  • identifier utldamspid, utldamsuri, filename
  • relatedItem UTLDAMS digital collection and subelements

OCR Selection 
Anchor
ocr_selection
ocr_selection

During a batch ingest of Paged Content, the specified OCR language must be supported in order for OCR to run. For further documentation about text extraction see here. The language should be set to the 3 digit language code. Currently supported languages and their codes are listed here. If OCR is specified and validated as a supported language, then the OCR will be run on the child pages. If specifying a language, the following line needs to be added to manifest:

Panel

LANG==[language code]

Info

If a language other than one of the enabled ones mentioned above is added to the manifest, the object will still be ingested but OCR extraction will not occur.

...

Warning

The ingest process will result in duplicate page numbers if the manifest assigns newly ingested images a page number that already exists in a partially ingested book.

Make sure to carefully assign page numbers to page images that are to be added to an existing book and avoid overlap with existing page numbers.

  1. Place appropriate following argument in your manifest where after the == is the book/publication PID without the namespace (ie utblac, utlarch).
    • HOSTPUBLICATION
    • HOSTBOOK
    • HOSTISSUE
  2. When you submit the batch request the pid of the sub-collection is ignored

Sample manifest (datastreams.txt) file when ingesting page(s) and/or issue(s).

Panel
borderColorgrey
titleSample manifest (datastreams.txt)

HOSTPUBLICATION==e0026f7d-9a79-4a8d-8a83-efc153c6a449

PAGE001==firstpagefilename.tiff

PAGE002==secondpagefilename.tiff

PAGE001_MODS==firstpagefilename.xml

PAGE002_MODS==secondpagefilename.xml

PAGE001_OCR_CUSTOM==firstpageocrfilename.txt (note: must be text file to get properly indexed)

PAGE002_OCR_CUSTOM==secondpageocrfilename.txt

FULL_TEXT_CUSTOM==yourcustomocr.txt

PDF==pdffilename.pdf (optional)

PAGE001_ALTO==altooutput.xml (optional)


3. When adding pages the directory with the pages to add should have the suffix '_PAGES'

Panel
borderColorblack
borderStylesolid
titleSample Directory Structure
  • eid1234_example-batch-adding-pages (Top level directory)
    • my_pages_to_add_PAGES
      • firstpagefilename.tiff
      • secondpagefilename.tiff
      • firstpagefilename.xml
      • secondpagefilename.xml
      • firstpageocrfilename.txt
      • secondpageocrfilename.txt
      • yourcustomocr.txt
      • pdffilename.pdf
      • altooutput.xml
      • datastreams.txt