Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Batch ingest for paged resources is different enough it has its own dedicated space, it is meant to supplement the batch ingest instructions found here that are more comprehensive. 

Note

When you are ingesting PDF files as paged content with the intention to have the document split up into individual pages, prepare your source files as follows:

  • Stage only the PDF file and applicable metadata or derivative datastreams. Do not stage page images along with the PDF. The DAMS software will not attempt to split a PDF file into page images if you ingest a PDF document together with page image files.
  • Remove spaces from PDF filenames. If you stage a PDF file with a name containing spaces for ingest, the DAMS software will not create page images.
Info
titleBatch Constraints
  • If you have your own PDF datastream provided there are not any hard constraints on upper limits for the number of pages though we recommend remaining under 500 pages per batch submission for right now.
  • If you are not supplying your own PDF datastream then please keep the number of pages 10 or fewer. If you need to ingest an object larger than this either provide your own pdf or split up your resource to be ingested into multiple batch submissions.

Directory Structure for Batch Ingest of Paged Content

For batch ingest your assets need to be organized with a top level directory you will reference in the batch submission form in the DAMS GUI, one or more directories with your asset(s) and the suffix denoting it's resource type, a manifest (datastreams.txt)  and the content. 

Note

The batch ingest process expects the assets to be staged in a folder structure as described here. For instance, you need to contain each xzy_BOOK/_PUBLICATION/_ISSUE/_PAGES folder in a top level directory for the batch job. The batch process will not complete otherwise.

Panel
titleExample Batch Submission
  • eid1234_example-batch-submission (Top level directory)
    • grapes_of_wrath_BOOK
      • datastreams.txt 
      • modsfile.xml
      • page1.tiff
      • page1mods.xml
    • wall_street_journal_PUBLICATION
      • datastreams.txt*

      • modsfile.xml
      • jan_2016_ISSUE
        • datastreams.txt
        • page1.tiff
        • page1_mods.xml
    • feb_2016_ISSUE
      • datastreams.txt
      • page1.tiff
      • page1_mods.xml
    • new_york_times_PAGES
      • datastreams.txt
      • page1.tiff
      • page1_mods.xml

*Note that if you are not including MODS at the publication level then still need a blank manifest at this level 

...

Note: All directories need to have one of the following suffixes to indicate to the system what that directory content is for.

  • _BOOK
  • _PAGES 
  • _ISSUE
  • _PUBLICATION

...

The Manifest, a file named 'datastreams.txt' is what defines what will be included in the resource you are ingesting.

Manifest Arguments

Tip
titleHelpful Hints
  • The number of 0's for padding is up to you.
  • If you are adding your own OCR datastream, then you will need to append "_CUSTOM" to the end.
  • The filename of other datastreams does not need to match the page filename.
  • If you are ingesting your own PDF along with pages, then set PDF to the pdf file name so that the system will not create an additional PDF datastream.

The available manifest arguments are:

...

actual page content

...

Sample manifest (datastreams.txt) file for a book and/or publication:

Panel
borderColorgrey
titleSample manifest (datastreams.txt)

MODS==modsfilename.xml

TN==thumbnailfilename.jpg

OTHERDATASTREAM==filename.ext

Adding Custom Datastreams

You can provided custom datastreams with your own custom defined naming convention. Please do not use any of the Restricted Datastream IDs and add the suffix '_CUSTOM' to your Datastream ID. 

MODS System Generated Fields 

  • keydate: if batch ingesting paged content, users need to add keyGen="yes" attribute to appropriate MODS element (dateCreated or dateIssued) to indicate which is keydate. If there is no keyGen, issues will appear to be missing from your Publication. Find issues with missing keydates is by going to the publication > Manage > Publication. Fix problem by editing MODS for problematic issue. 
  • recordCreationDate
  • identifier utldamspid, utldamsuri, filename
  • relatedItem UTLDAMS digital collection and subelements

...

During a batch ingest of Paged Content, the specified OCR language must be supported in order for OCR to run. For further documentation about text extraction see here. The language should be set to the 3 digit language code. Currently supported languages and their codes are listed here. If OCR is specified and validated as a supported language, then the OCR will be run on the child pages. If specifying a language, the following line needs to be added to manifest:

Panel

LANG==[language code]

Info

If a language other than one of the enabled ones mentioned above is added to the manifest, the object will still be ingested but OCR extraction will not occur.

...

Warning

The ingest process will result in duplicate page numbers if the manifest assigns newly ingested images a page number that already exists in a partially ingested book.

Make sure to carefully assign page numbers to page images that are to be added to an existing book and avoid overlap with existing page numbers.

  1. Place appropriate following argument in your manifest where after the == is the book/publication PID without the namespace (ie utblac, utlarch).
    • HOSTPUBLICATION
    • HOSTBOOK
    • HOSTISSUE
  2. When you submit the batch request the pid of the sub-collection is ignored

Sample manifest (datastreams.txt) file when ingesting page(s) and/or issue(s).

Panel
borderColorgrey
titleSample manifest (datastreams.txt)

HOSTPUBLICATION==e0026f7d-9a79-4a8d-8a83-efc153c6a449

PAGE001==firstpagefilename.tiff

PAGE002==secondpagefilename.tiff

PAGE001_MODS==firstpagefilename.xml

PAGE002_MODS==secondpagefilename.xml

PAGE001_OCR_CUSTOM==firstpageocrfilename.txt (note: must be text file to get properly indexed)

PAGE002_OCR_CUSTOM==secondpageocrfilename.txt

FULL_TEXT_CUSTOM==yourcustomocr.txt

PDF==pdffilename.pdf (optional)

PAGE001_ALTO==altooutput.xml (optional)

3. When adding pages the directory with the pages to add should have the suffix '_PAGES'

...

borderColorblack
borderStylesolid
titleSample Directory Structure

...

Table of Contents

What this does

The batch ingest process for complex assets, or paged content, allows you to batch-create DAMS assets that consist of component parts, typically pages of a book/serial issue/archival file/etc. See pages Anatomy of DAMS digital assets and Content models for details on the structure of digital assets.

Assets created with this batch ingest method will consist of a set of page-level assets, a book/publication issue-level asset, and optionally a publication (series)-level asset. You can also use this method to add publication issues to an existing publication series, and you can add pages to existing book-level or issue-level assets.

Typically, a book-level asset and its pages are created using image files that are the result of a scanning process (digitally reformatted content). This batch ingest method also allows to use a PDF document as a source file, and the DAMS software will automatically create an image file for each page in a PDF submitted. For digitally reformatted (scanned) content, using a PDF is strongly discouraged, as the automatically created page images are almost invariably of lower quality than the original scan images. Contact the DAMS managers for a consultation (click here to submit a DAMS service request).

For born-digital content (for instance modern PDF ebooks or PDF documents directly exported from a word processor), other content models and ingest processes will be more appropriate. Contact the DAMS managers for a consultation (click here to submit a DAMS service request).

Multiexcerpt include
MultiExcerptNameBatch ingest general instructions
PageWithExcerptBatch ingest simple assets

Staging files for batch ingest of paged content

For batch ingest, your assets MUST be organized within a top level directory which represents the batch job. You will reference the batch job directory in the batch submission form in the DAMS GUI. Each batch job directory MUST contain one or more subdirectories, representing a book/publication issue, a publication/series, or a set of pages that should be added to an existing paged content asset. You can combine multiple books, issues, page addition sets in a batch job, as long as the job does not exceed the file and job size limitations.

Creating a datastreams.txt manifest file

Subdirectories in the batch job folder MUST each contain a manifest file named datastreams.txt. The manifest file specifies the intended structure of the DAMS asset, for instance specifying the order of page images, pointing to the MODS XML containing the metadata for the asset or a thumbnail image to be ingested as the TN datastream. The manifest file can also designate the language of the content, to instruct the DAMS to perform Optical Character Recognition (OCR) on the ingested pages.

Each line of the manifest file contains an argument-value pair in the following format:

<ARGUMENT>==<VALUE> 

Use 2 (two) equal signs to separate arguments and values.

Code Block
titleSample datastreams.txt manifest file for a book or issue-level asset
MODS==modsfilename.xml
TN==thumbnailfilename.jpg
PAGE001==filename001.tif
PAGE002==filename002.tif
PAGE003==filename003.tif
LANG==eng

Manifest Arguments

The available manifest arguments are:

<ARGUMENT>Value AssociatedPurposeAccepted File TypesAdditional Notes
MODSMODS XML file nameprovide MODS metadata for an assetxmlCan be used for publication/series-level assets, book and issue-level assets.
TNthumbnail image file nameprovide a thumbnail picture for an assetpng, jpg, jpeg

Can be used for publication/series-level assets, book and issue-level assets.

If no thumbnail is provided during batch ingest, the DAMS will copy the thumbnail image of the first page of the asset to the book/issue level asset.

LANG

three-letter language code

instruct the DAMS software to perform OCR for each pageN/A

Can be used for book/issue-level assets.

See page _Text extraction in DAMS for the list of languages for which the DAMS software supports OCR processing.

Note

The OCR software built into the DAMS provides unoptimized recognition results for a limited set of supported languages. Consult with Digitization Services about the external OCR software available for processing, which will yield better recognition results.


Info

If you specify a language not supported by the DAMS software, the asset will still be ingested but no OCR extraction will be performed.


PAGE<NUMBER>name of the file with the page image

provide page content, in sequential order


tiff, tif, jp2

Can be used for book/issue-level assets.

Replace <NUMBER> with a number for each page that indicates the page's sequential order, for example:

Code Block
PAGE001==filename001.tif
PAGE002==filename002.tif
(etc.)

Pad the number with zeroes. The number of zeroes for padding is up to you.

PAGE<NUMBER>_OCR_CUSTOMname of externally generated OCR file for that pageallows you to provide your own OCR datastream for each pagetxt

Can be used for book/issue-level assets.

PAGE<NUMBER>_<CUSTOM_DATASTREAM>name of additional fileallows you to add custom datastreams to page-level assets*

Replace <CUSTOM_DATASTREAM> with a datastream label. The label should correspond to one of the recommended datastream types listed on page Anatomy of DAMS digital assets.

If you wish to ingest additional files that do not match any of the listed datastream types, please contact the DAMS managers for consultation (click here to submit a DAMS service request).

Warning

DO NOT use any of the Restricted Datastream IDs.

DO NOT use any of the system-generated datastream labels to ingest additional files, as they may be overwritten by the DAMS software.


FULL_TEXT_CUSTOMname of text file with externally created full text (text extracted from PDF)allows you to provide your own FULL_TEXT datastream for a book/issuetxt

Can be used for book/issue-level assets.

Note

Use only for assets where the primary source file is a PDF document and for full text produced with pdftotext. See page _Text extraction in DAMS for details on the different text extraction/recognition methods.


PDFname of your pdf filePDF for resourcepdf

Can be used for book/issue-level assets.

Use to add an externally created PDF document to an asset.

Info

If no page images are specified in the manifest, the DAMS will render image files from the pages of the PDF document and use these images to create page-level assets.

For digitally reformatted (scanned) content, using a PDF as a source for creating page images is strongly discouraged, as the automatically created page images are almost invariably of lower quality than the original scan images. Contact the DAMS managers for a consultation (click here to submit a DAMS service request).

For born-digital content (for instance modern PDF ebooks or PDF documents directly exported from a word processor), other content models and ingest processes will be more appropriate. Contact the DAMS managers for a consultation (click here to submit a DAMS service request).


HOSTPUBLICATIONPID without namespace IDAdd issue(s) to publicationtext

Can be used for book/issue-level assets.

Use to specify which publication/series-level asset an issue shold be added to.

PID without namespace ID is the part of a PID after the colon (UUID), e.g. 9ebf6ac8-1823-4bf4-8398-654b54090776 for PID utlarch:9ebf6ac8-1823-4bf4-8398-654b54090776.

HOSTISSUEPID without namespace IDAdd pages to an issuetext

Can be used with sets of page images.

Use to specify which issue-level asset a set of page images should be added to.

PID without namespace ID is the part of a PID after the colon (UUID), e.g. 9ebf6ac8-1823-4bf4-8398-654b54090776 for PID utlarch:9ebf6ac8-1823-4bf4-8398-654b54090776.

HOSTBOOKPID without namespace IDAdd pages to a booktext

Can be used with sets of page images.

Use to specify which book-level asset a set of page images should be added to.

PID without namespace ID is the part of a PID after the colon (UUID), e.g. 9ebf6ac8-1823-4bf4-8398-654b54090776 for PID utlarch:9ebf6ac8-1823-4bf4-8398-654b54090776.

Folder naming conventions and folder hierarchy

Subdirectories inside of the batch job folder MUST contain one of the following suffixes as part of their name, to denote the content type or type of batch ingest:

  • _PUBLICATION
  • _ISSUE
  • _BOOK
  • _PAGES

<foldername>_PUBLICATION

Use to create a publication/series-level asset. These assets are intended to organize publication issues inside the DAMS. The DAMS GUI will show a calendar display to navigate publication issues based on creation/issuance month and year. Publication/series-level assets cannot be published to the Collections Portal and there is currently no calendar display available on the Collections Portal to browse serial issues.

Note

If you are not ingesting data at the publication level, e.g. MODS XML metadata, you still need to add an empty manifest file named datastreams.txt at this folder level.

A <foldername>_PUBLICATION folder can contain multiple folders with the suffix _ISSUE, in order to create the publication/series-level asset and corresponding issue-level assets at the same time.

DO NOT nest _ISSUE folders under a _PUBLICATION folder if you want to add issue-level assets to an existing publication/serial-level asset. Refer to the following section for instructions on how to add issue-level assets to an existing publication/serial-level asset.

Code Block
titleSample datastreams.txt for publication/series-level asset
MODS==modsfile.xml

<foldername>_ISSUE

Use to create an issue-level asset from scanned page images or a PDF.

<foldername>_ISSUE folders can be nested under a folder with the _PUBLICATION suffix, in order to create the publication/series-level asset and corresponding issue-level assets at the same time.

If you want to add an issue-level asset to an existing publication/series-level asset, DO NOT nest _ISSUE folders under a _PUBLICATION folder. Instead, place the _ISSUE folder directly inside the batch job folder and specify the host publication asset in the datastreams.txt manifest like this:
HOSTPUBLICATION==<PID of the publication/series-level asset> 

Note

When submitting a batch ingest job intended to add an issue to an existing publication, you must enter the PID of the subcollection containing the publication/series-level asset. This is a known bug.

When you are ingesting PDF files as paged content with the intention to have the document split up into individual pages, prepare your source files as follows:

  • Stage only the PDF file and applicable metadata or derivative datastreams. Do not stage page images along with the PDF. The DAMS software will not attempt to split a PDF file into page images if you ingest a PDF document together with page image files.
  • Remove spaces from PDF filenames. If you stage a PDF file with a name containing spaces for ingest, the DAMS software will not create page images.


Code Block
titleSample datastreams.txt for issue asset
HOSTPUBLICATION==e0026f7d-9a79-4a8d-8a83-efc153c6a449 <-- only needed when adding an issue to an existing publication!
MODS==modsfile.xml
PAGE001==page01.tif
PAGE002==page02.tif
PAGE001_OCR_CUSTOM==page01_custom_ocr.txt
PAGE002_OCR_CUSTOM==page02_custom_ocr.txt
LANG==eng

<foldername>_BOOK

Use to create a book-level asset from scanned page images or a PDF.

When you are ingesting PDF files as paged content with the intention to have the document split up into individual pages, prepare your source files as follows:

  • Stage only the PDF file and applicable metadata or derivative datastreams. Do not stage page images along with the PDF. The DAMS software will not attempt to split a PDF file into page images if you ingest a PDF document together with page image files.
  • Remove spaces from PDF filenames. If you stage a PDF file with a name containing spaces for ingest, the DAMS software will not create page images.


Code Block
titleSample datastreams.txt for book asset
MODS==modsfile.xml
PAGE001==page01.tif
PAGE002==page02.tif
PAGE001_OCR_CUSTOM==page01_custom_ocr.txt
PAGE002_OCR_CUSTOM==page02_custom_ocr.txt
OCR_CUSTOM==book_level_custom_ocr.txt
PDF==book_level_pdf.pdf

<foldername>_PAGES

Use to add page images and other datastreams to an existing book or issue-level asset.

Place the <foldername>_PAGES folder directly inside the batch job folder and add page images. In the datastreams.txt manifest file, add entries for each page image and specify the page order as part of the PAGE<NUMBER> argument, for example:

Code Block
PAGE001==filename001.tif
PAGE002==filename002.tif
(etc.)


Warning

The ingest process will result in duplicate page numbers if the manifest assigns newly ingested images a page number that already exists in a partially ingested book.

Make sure to carefully assign page numbers to page images that are to be added to an existing book and avoid overlap with existing page numbers.


Code Block
titleSample datastreams.txt for adding pages to existing book
HOSTBOOK==e0026f7d-9a79-4a8d-8a83-efc153c6a449
PAGE001==page01.tif
PAGE002==page02.tif
PAGE001_OCR_CUSTOM==page01_custom_ocr.txt
PAGE002_OCR_CUSTOM==page02_custom_ocr.txt
OCR_CUSTOM==issue_level_custom_ocr.txt
PDF==issue_level_pdf.pdf

Sample batch folder structure

Code Block
eid1234_example-batch-submission/ (batch job folder)
├── grapes_of_wrath_BOOK/
│   ├── datastreams.txt
│   ├── modsfile.xml
│	├── book_level_custom_ocr.txt
│	├── book_level_pdf.pdf
│   ├── page01.tif
│   └── page02.tif
├── wall_street_journal_PUBLICATION/
│   ├── datastreams.txt
│   ├── modsfile.xml
│   ├── wsj_jan_2016_ISSUE/
│	│	├── datastreams.txt
│	│	├── modsfile.xml
│	│	├── page01.tif
│	│	└── page02.tif
│   └── wsj_feb_2016_ISSUE/
│		├── datastreams.txt
│		├── modsfile.xml
│		├── page01.tif
│		└── page02.tif
├──	ascii_art_monthly_july_2021_ISSUE/
│   ├── datastreams.txt
│   ├── modsfile.xml
│   ├── page01.tif
│   ├── page01_custom_ocr.txt
│   ├── page02.tif
│   └── page02_custom_ocr.txt
└──	nyt_2020-11-04_PAGES/
	├── datastreams.txt
	├── issue_level_custom_ocr.txt
	├── issue_level_pdf.pdf
    ├── page01.tif
    ├── page01_custom_ocr.txt
    ├── page02.tif
    └── page02_custom_ocr.txt

Step 2: Upload batch job to Jscape

Multiexcerpt include
MultiExcerptNameBatch ingest upload
PageWithExcerptBatch ingest simple assets

Step 3: Set up collection and submit form in DAMS interface

Multiexcerpt include
MultiExcerptNamebatch ingest queue
PageWithExcerptBatch ingest simple assets