General information for batch ingest

The batch ingest process runs continuously, looking for newly queued batch jobs approximately every 5 minutes. You can add batch ingest jobs to the queue at any time.

Batch jobs are subject to the following batch job size and file size limitations:

max. 100GB/batch job
max. 10GB/file

Step 1: Stage files for batch ingest job

Organise files in a batch job folder, using subfolders if appropriate. Refer to the instructions/options listed below for preparing batch jobs.

When you are ingesting PDF files as paged content with the intention to have the document split up into individual pages, prepare your source files as follows:

Stage only the PDF file and applicable metadata or derivative datastreams. Do not stage page images along with the PDF. The DAMS software will not attempt to split a PDF file into page images if you ingest a PDF document together with page image files.
Remove spaces from PDF filenames. If you stage a PDF file with a name containing spaces for ingest, the DAMS software will not create page images.

Directory Structure for Batch Ingest of Paged Content

For batch ingest your assets need to be organized with a top level directory you will reference in the batch submission form in the DAMS GUI, one or more directories with your asset(s) and the suffix denoting it's resource type, a manifest (datastreams.txt) and the content.

The batch ingest process expects the assets to be staged in a folder structure as described here. For instance, you need to contain each xzy_BOOK/_PUBLICATION/_ISSUE/_PAGES folder in a top level directory for the batch job. The batch process will not complete otherwise.

Example Batch Submission

eid1234_example-batch-submission (Top level directory)
- grapes_of_wrath_BOOK
  - datastreams.txt
  - modsfile.xml
  - page1.tiff
  - page1mods.xml
- wall_street_journal_PUBLICATION
  - datastreams.txt*
  - modsfile.xml
  - jan_2016_ISSUE
    - datastreams.txt
    - page1.tiff
    - page1_mods.xml
- feb_2016_ISSUE
  - datastreams.txt
  - page1.tiff
  - page1_mods.xml
- new_york_times_PAGES
  - datastreams.txt
  - page1.tiff
  - page1_mods.xml

*Note that if you are not including MODS at the publication level then still need a blank manifest at this level

Suffixes

Note: All directories need to have one of the following suffixes to indicate to the system what that directory content is for.

_BOOK
_PAGES
_ISSUE
_PUBLICATION

Guidelines for Manifest

The Manifest, a file named 'datastreams.txt' is what defines what will be included in the resource you are ingesting.

Manifest Arguments

Helpful Hints

The number of 0's for padding is up to you.
If you are adding your own OCR datastream, then you will need to append "_CUSTOM" to the end.
The filename of other datastreams does not need to match the page filename.
If you are ingesting your own PDF along with pages, then set PDF to the pdf file name so that the system will not create an additional PDF datastream.

The available manifest arguments are:

Argument	Value Associated	Purpose	Accepted Data Type	Additional Notes
MODS	name of your mods file	provide MODS for your resource	xml
TN	thumbnail picture	provide thumbnail picture for your resource	png, jpg, jpeg
LANG	one of the supported language codes here	To have OCR created for each page	text
PAGE + NUMBER	name of the file with the page object	actual page content	tiff, tif, jp2, jpg, jpeg
PAGE + OCR_CUSTOM	name of your ocr file for that page	allows you to provide your own OCR	textfile
FULL_TEXT_CUSTOM	name of your full text for your pdf file	allows you to provide your own FULL_TEXT for the pdf	textfile
PDF	name of your pdf file	PDF for resource, will be cut up into individual pages if individual pages are not provided	pdf file	Do not provide PDFs at the page level. Only provide them at the book/issue level. If you do provide them we cannot guarantee they will not be overwritten by inferior quality system generated PDFs.
HOSTPUBLICATION	pid	Add issue(s) to publication	text
HOSTISSUE	pid	Add pages to an issue	text
HOSTBOOK	pid	Add pages to a book	text

Sample manifest (datastreams.txt) file for a book and/or publication:

Sample manifest (datastreams.txt)

MODS==modsfilename.xml

TN==thumbnailfilename.jpg

OTHERDATASTREAM==filename.ext

Adding Custom Datastreams

You can provided custom datastreams with your own custom defined naming convention. Please do not use any of the Restricted Datastream IDs and add the suffix '_CUSTOM' to your Datastream ID.

MODS System Generated Fields

keydate: if batch ingesting paged content, users need to add keyGen="yes" attribute to appropriate MODS element (dateCreated or dateIssued) to indicate which is keydate. If there is no keyGen, issues will appear to be missing from your Publication. Find issues with missing keydates is by going to the publication > Manage > Publication. Fix problem by editing MODS for problematic issue.
recordCreationDate
identifier utldamspid, utldamsuri, filename
relatedItem UTLDAMS digital collection and subelements

OCR Selection

During a batch ingest of Paged Content, the specified OCR language must be supported in order for OCR to run. For further documentation about text extraction see here. The language should be set to the 3 digit language code. Currently supported languages and their codes are listed here. If OCR is specified and validated as a supported language, then the OCR will be run on the child pages. If specifying a language, the following line needs to be added to manifest:

LANG==[language code]

If a language other than one of the enabled ones mentioned above is added to the manifest, the object will still be ingested but OCR extraction will not occur.

Adding to an Existing Book or Issue

The ingest process will result in duplicate page numbers if the manifest assigns newly ingested images a page number that already exists in a partially ingested book.

Make sure to carefully assign page numbers to page images that are to be added to an existing book and avoid overlap with existing page numbers.

Place appropriate following argument in your manifest where after the == is the book/publication PID without the namespace (ie utblac, utlarch).
- HOSTPUBLICATION
- HOSTBOOK
- HOSTISSUE
When you submit the batch request the pid of the sub-collection is ignored

Sample manifest (datastreams.txt) file when ingesting page(s) and/or issue(s).

Sample manifest (datastreams.txt)

HOSTPUBLICATION==e0026f7d-9a79-4a8d-8a83-efc153c6a449

PAGE001==firstpagefilename.tiff

PAGE002==secondpagefilename.tiff

PAGE001_MODS==firstpagefilename.xml

PAGE002_MODS==secondpagefilename.xml

PAGE001_OCR_CUSTOM==firstpageocrfilename.txt (note: must be text file to get properly indexed)

PAGE002_OCR_CUSTOM==secondpageocrfilename.txt

FULL_TEXT_CUSTOM==yourcustomocr.txt

PDF==pdffilename.pdf (optional)

PAGE001_ALTO==altooutput.xml (optional)

3. When adding pages the directory with the pages to add should have the suffix '_PAGES'

Sample Directory Structure

eid1234_example-batch-adding-pages (Top level directory)
- my_pages_to_add_PAGES
  - firstpagefilename.tiff
  - secondpagefilename.tiff
  - firstpagefilename.xml
  - secondpagefilename.xml
  - firstpageocrfilename.txt
  - secondpageocrfilename.txt
  - yourcustomocr.txt
  - pdffilename.pdf
  - altooutput.xml
  - datastreams.txt

_Batch ingest complex assets (paged content)