What this does
The batch ingest process for complex assets, or paged content, allows you to batch-create DAMS assets that consist of component parts, typically pages of a book/serial issue/archival file/etc. See pages Anatomy of DAMS digital assets and Content models for details on the structure of digital assets.
Assets created with this batch ingest method will consist of a set of page-level assets, a book/publication issue-level asset, and optionally a publication (series)-level asset. You can also use this method to add publication issues to an existing publication series, and you can add pages to existing book-level or issue-level assets.
Typically, a book-level asset and its pages are created using image files that are the result of a scanning process (digitally reformatted content). This batch ingest method also allows to use a PDF document as a source file, and the DAMS software will automatically create an image file for each page in a PDF submitted. For digitally reformatted (scanned) content, using a PDF is strongly discouraged, as the automatically created page images are almost invariably of lesser quality than the original scan images. Contact the DAMS managers for a consultation (click here to submit a DAMS service request).
For born-digital content, like modern PDF ebooks, other content models and ingest processes might be more appropriate. Contact the DAMS managers for a consultation (click here to submit a DAMS service request).
General information for batch ingest
The batch ingest process runs continuously, looking for newly queued batch jobs approximately every 5 minutes. You can add batch ingest jobs to the queue at any time.
Batch jobs are subject to the following batch job size and file size limitations:
- max. 100GB/batch job
- max. 10GB/file
Step 1: Stage files for batch ingest job
Organise files in a batch job folder, using subfolders if appropriate. Refer to the instructions/options listed below for preparing batch jobs.
When you are ingesting PDF files as paged content with the intention to have the document split up into individual pages, prepare your source files as follows:
- Stage only the PDF file and applicable metadata or derivative datastreams. Do not stage page images along with the PDF. The DAMS software will not attempt to split a PDF file into page images if you ingest a PDF document together with page image files.
- Remove spaces from PDF filenames. If you stage a PDF file with a name containing spaces for ingest, the DAMS software will not create page images.
Staging files for batch ingest of paged content
For batch ingest, your assets MUST be organized within a top level directory which represents the batch job. You will reference the batch job directory in the batch submission form in the DAMS GUI. Each batch job directory MUST contain one or more subdirectories, representing a book/publication issue, a publication/series, or a set of pages that should be added to an existing paged content asset. You can combine multiple books, issues, page addition sets in a batch job, as long as the job does not exceed the file and job size limitations.
Furthermore, publication issues can be nested within a subdirectory representing a publication/series, for instance if you want to create a publication/series and add publication issues at the same time. You can add further issues to the same publication with a separate batch job.
Subdirectories in the batch job folder MUST use one of the following suffixes to denote the content type or type of batch ingest:
<foldername>_PUBLICATION
Used to create a publication/series-level asset. These assets are intended to organize publication issues inside the DAMS. The DAMS GUI will provide a
<foldername>_ISSUE
<foldername>_BOOK
<foldername>_PAGES
, one or more directories with your asset(s) and the suffix denoting it's content type, a manifest (datastreams.txt) and the content.
- eid1234_example-batch-submission (Top level directory)
- grapes_of_wrath_BOOK
- datastreams.txt
- modsfile.xml
- page1.tiff
- page1mods.xml
- wall_street_journal_PUBLICATION
datastreams.txt*
- modsfile.xml
- jan_2016_ISSUE
- datastreams.txt
- page1.tiff
- page1_mods.xml
- feb_2016_ISSUE
- datastreams.txt
- page1.tiff
- page1_mods.xml
- new_york_times_PAGES
- datastreams.txt
- page1.tiff
- page1_mods.xml
- grapes_of_wrath_BOOK
*Note that if you are not including MODS at the publication level then still need a blank manifest at this level.
Guidelines for Manifest
The Manifest, a file named 'datastreams.txt' is what defines what will be included in the resource you are ingesting.
Manifest Arguments
Helpful Hints
- The number of 0's for padding is up to you.
- If you are adding your own OCR datastream, then you will need to append "_CUSTOM" to the end.
- The filename of other datastreams does not need to match the page filename.
- If you are ingesting your own PDF along with pages, then set PDF to the pdf file name so that the system will not create an additional PDF datastream.
The available manifest arguments are:
Argument | Value Associated | Purpose | Accepted Data Type | Additional Notes |
---|---|---|---|---|
MODS | name of your mods file | provide MODS for your resource | xml | |
TN | thumbnail picture | provide thumbnail picture for your resource | png, jpg, jpeg | |
LANG | one of the supported language codes here | To have OCR created for each page | text | |
PAGE + NUMBER | name of the file with the page object | actual page content | tiff, tif, jp2, jpg, jpeg | |
PAGE + OCR_CUSTOM | name of your ocr file for that page | allows you to provide your own OCR | textfile | |
FULL_TEXT_CUSTOM | name of your full text for your pdf file | allows you to provide your own FULL_TEXT for the pdf | textfile | |
name of your pdf file | PDF for resource, will be cut up into individual pages if individual pages are not provided | pdf file | Do not provide PDFs at the page level. Only provide them at the book/issue level. If you do provide them we cannot guarantee they will not be overwritten by inferior quality system generated PDFs. | |
HOSTPUBLICATION | pid | Add issue(s) to publication | text | |
HOSTISSUE | pid | Add pages to an issue | text | |
HOSTBOOK | pid | Add pages to a book | text |
Sample manifest (datastreams.txt) file for a book and/or publication:
MODS==modsfilename.xml
TN==thumbnailfilename.jpg
OTHERDATASTREAM==filename.ext
Adding Custom Datastreams
You can provided custom datastreams with your own custom defined naming convention. Please do not use any of the Restricted Datastream IDs and add the suffix '_CUSTOM' to your Datastream ID.
MODS System Generated Fields
- keydate: if batch ingesting paged content, users need to add keyGen="yes" attribute to appropriate MODS element (dateCreated or dateIssued) to indicate which is keydate. If there is no keyGen, issues will appear to be missing from your Publication. Find issues with missing keydates is by going to the publication > Manage > Publication. Fix problem by editing MODS for problematic issue.
- recordCreationDate
- identifier utldamspid, utldamsuri, filename
- relatedItem UTLDAMS digital collection and subelements
OCR Selection
During a batch ingest of Paged Content, the specified OCR language must be supported in order for OCR to run. For further documentation about text extraction see here. The language should be set to the 3 digit language code. Currently supported languages and their codes are listed here. If OCR is specified and validated as a supported language, then the OCR will be run on the child pages. If specifying a language, the following line needs to be added to manifest:
LANG==[language code]
If a language other than one of the enabled ones mentioned above is added to the manifest, the object will still be ingested but OCR extraction will not occur.
Adding to an Existing Book or Issue
The ingest process will result in duplicate page numbers if the manifest assigns newly ingested images a page number that already exists in a partially ingested book.
Make sure to carefully assign page numbers to page images that are to be added to an existing book and avoid overlap with existing page numbers.
- Place appropriate following argument in your manifest where after the == is the book/publication PID without the namespace (ie utblac, utlarch).
- HOSTPUBLICATION
- HOSTBOOK
- HOSTISSUE
- When you submit the batch request the pid of the sub-collection is ignored
Sample manifest (datastreams.txt) file when ingesting page(s) and/or issue(s).
HOSTPUBLICATION==e0026f7d-9a79-4a8d-8a83-efc153c6a449
PAGE001==firstpagefilename.tiff
PAGE002==secondpagefilename.tiff
PAGE001_MODS==firstpagefilename.xml
PAGE002_MODS==secondpagefilename.xml
PAGE001_OCR_CUSTOM==firstpageocrfilename.txt (note: must be text file to get properly indexed)
PAGE002_OCR_CUSTOM==secondpageocrfilename.txt
FULL_TEXT_CUSTOM==yourcustomocr.txt
PDF==pdffilename.pdf (optional)
PAGE001_ALTO==altooutput.xml (optional)
3. When adding pages the directory with the pages to add should have the suffix '_PAGES'
- eid1234_example-batch-adding-pages (Top level directory)
- my_pages_to_add_PAGES
- firstpagefilename.tiff
- secondpagefilename.tiff
- firstpagefilename.xml
- secondpagefilename.xml
- firstpageocrfilename.txt
- secondpageocrfilename.txt
- yourcustomocr.txt
- pdffilename.pdf
- altooutput.xml
- datastreams.txt
- my_pages_to_add_PAGES