Table of Contents |
---|
What this does
The batch ingest process for complex assets, or paged content, allows you to batch-create DAMS assets that consist of component parts, typically pages of a book/serial issue/archival file/etc. See pages Anatomy of DAMS digital assets and Content models for details on the structure of digital assets.
...
Multiexcerpt include | ||||
---|---|---|---|---|
|
Staging files for batch ingest of paged content
For batch ingest, your assets MUST be organized within a top level directory which represents the batch job. You will reference the batch job directory in the batch submission form in the DAMS GUI. Each batch job directory MUST contain one or more subdirectories, representing a book/publication issue, a publication/series, or a set of pages that should be added to an existing paged content asset. You can combine multiple books, issues, page addition sets in a batch job, as long as the job does not exceed the file and job size limitations.
Creating a datastreams.txt manifest file
Subdirectories in the batch job folder MUST each contain a manifest file named datastreams.txt
. The manifest file specifies the intended structure of the DAMS asset, for instance specifying the order of page images, pointing to the MODS XML containing the metadata for the asset or a thumbnail image to be ingested as the TN datastream. The manifest file can also designate the language of the content, to instruct the DAMS to perform Optical Character Recognition (OCR) on the ingested pages.
...
Code Block | ||
---|---|---|
| ||
MODS==modsfilename.xml TN==thumbnailfilename.jpg PAGE001==filename001.tif PAGE002==filename002.tif PAGE003==filename003.tif LANG==eng |
Manifest Arguments
The available manifest arguments are:
<ARGUMENT> | Value Associated | Purpose | Accepted File Types | Additional Notes | ||||
---|---|---|---|---|---|---|---|---|
MODS | MODS XML file name | provide MODS metadata for an asset | xml | Can be used for publication/series-level assets, book and issue-level assets. | ||||
TN | thumbnail image file name | provide a thumbnail picture for an asset | png, jpg, jpeg | Can be used for publication/series-level assets, book and issue-level assets. If no thumbnail is provided during batch ingest, the DAMS will copy the thumbnail image of the first page of the asset to the book/issue level asset. | ||||
LANG | three-letter language code | instruct the DAMS software to perform OCR for each page | N/A | Can be used for book/issue-level assets. See page _Text extraction in DAMS for the list of languages for which the DAMS software supports OCR processing.
| ||||
PAGE<NUMBER> | name of the file with the page image | provide page content, in sequential order | tiff, tif, jp2 | Can be used for book/issue-level assets. Replace <NUMBER> with a number for each page that indicates the page's sequential order, for example:
Pad the number with zeroes. The number of zeroes for padding is up to you. | ||||
PAGE<NUMBER>_OCR_CUSTOM | name of externally generated OCR file for that page | allows you to provide your own OCR datastream for each page | txt | Can be used for book/issue-level assets. | ||||
PAGE<NUMBER>_<CUSTOM_DATASTREAM> | name of additional file | allows you to add custom datastreams to page-level assets | * | Replace <CUSTOM_DATASTREAM> with a datastream label. The label should correspond to one of the recommended datastream types listed on page Anatomy of DAMS digital assets. If you wish to ingest additional files that do not match any of the listed datastream types, please contact the DAMS managers for consultation (click here to submit a DAMS service request).
| ||||
FULL_TEXT_CUSTOM | name of text file with externally created full text (text extracted from PDF) | allows you to provide your own FULL_TEXT datastream for a book/issue | txt | Can be used for book/issue-level assets.
| ||||
name of your pdf file | PDF for resource | Can be used for book/issue-level assets. Use to add an externally created PDF document to an asset.
| ||||||
HOSTPUBLICATION | PID without namespace ID | Add issue(s) to publication | text | Can be used for book/issue-level assets. Use to specify which publication/series-level asset an issue shold be added to. PID without namespace ID is the part of a PID after the colon (UUID), e.g. | ||||
HOSTISSUE | PID without namespace ID | Add pages to an issue | text | Can be used with sets of page images. Use to specify which issue-level asset a set of page images should be added to. PID without namespace ID is the part of a PID after the colon (UUID), e.g. | ||||
HOSTBOOK | PID without namespace ID | Add pages to a book | text | Can be used with sets of page images. Use to specify which book-level asset a set of page images should be added to. PID without namespace ID is the part of a PID after the colon (UUID), e.g. |
Folder naming conventions and folder hierarchy
Subdirectories inside of the batch job folder MUST contain one of the following suffixes as part of their name, to denote the content type or type of batch ingest:
- _PUBLICATION
- _ISSUE
- _BOOK
- _PAGES
<foldername>_PUBLICATION
Use to create a publication/series-level asset. These assets are intended to organize publication issues inside the DAMS. The DAMS GUI will show a calendar display to navigate publication issues based on creation/issuance month and year. Publication/series-level assets cannot be published to the Collections Portal and there is currently no calendar display available on the Collections Portal to browse serial issues.
...
DO NOT nest _ISSUE folders under a _PUBLICATION folder if you want to add issue-level assets to an existing publication/serial-level asset. Refer to the following section for instructions on how to add issue-level assets to an existing publication/serial-level asset.
Code Block | ||
---|---|---|
| ||
MODS==modsfile.xml |
<foldername>_ISSUE
Use to create an issue-level asset from scanned page images or a PDF.
...
- Stage only the PDF file and applicable metadata or derivative datastreams. Do not stage page images along with the PDF. The DAMS software will not attempt to split a PDF file into page images if you ingest a PDF document together with page image files.
- Remove spaces from PDF filenames. If you stage a PDF file with a name containing spaces for ingest, the DAMS software will not create page images.
<foldername>_BOOK
Use to create a book-level asset from scanned page images or a PDF.
When you are ingesting PDF files as paged content with the intention to have the document split up into individual pages, prepare your source files as follows:
...
Code Block | ||
---|---|---|
| ||
HOSTPUBLICATION==e0026f7d-9a79-4a8d-8a83-efc153c6a449 <-- only needed when adding an issue to an existing publication!
MODS==modsfile.xml
PAGE001==page01.tif
PAGE002==page02.tif
PAGE001_OCR_CUSTOM==page01_custom_ocr.txt
PAGE002_OCR_CUSTOM==page02_custom_ocr.txt
LANG==eng
|
<foldername>_BOOK
Use to create a book-level asset from scanned page images or a PDF.
When you are ingesting PDF files as paged content with the intention to have the document split up into individual pages, prepare your source files as follows:
- Stage only the PDF file and applicable metadata or derivative datastreams. Do not stage page images along with the PDF. The DAMS software will not attempt to split a PDF file into page images if you ingest a PDF document together with page image files.Remove spaces from PDF filenames. If you stage a PDF file with a name containing spaces for ingest, the DAMS software will not create page images.a PDF file into page images if you ingest a PDF document together with page image files.
- Remove spaces from PDF filenames. If you stage a PDF file with a name containing spaces for ingest, the DAMS software will not create page images.
Code Block | ||
---|---|---|
| ||
MODS==modsfile.xml
PAGE001==page01.tif
PAGE002==page02.tif
PAGE001_OCR_CUSTOM==page01_custom_ocr.txt
PAGE002_OCR_CUSTOM==page02_custom_ocr.txt
OCR_CUSTOM==book_level_custom_ocr.txt
PDF==book_level_pdf.pdf |
<foldername>_PAGES
Use to add page images and other datastreams to an existing book or issue-level asset.
...
Warning |
---|
The ingest process will result in duplicate page numbers if the manifest assigns newly ingested images a page number that already exists in a partially ingested book. Make sure to carefully assign page numbers to page images that are to be added to an existing book and avoid overlap with existing page numbers. |
Code Block | ||
---|---|---|
| ||
HOSTBOOK==e0026f7d-9a79-4a8d-8a83-efc153c6a449 PAGE001==firstpagefilenamepage01.tifftif PAGE002==secondpagefilename.tiff PAGE001_MODS==firstpagefilename.xml PAGE002_MODS==secondpagefilename.xmlpage02.tif PAGE001_OCR_CUSTOM==firstpageocrfilenamepage01_custom_ocr.txt (note: must be text file to get properly indexed) PAGE002_OCR_CUSTOM==secondpageocrfilenamepage02_custom_ocr.txt FULL_TEXTOCR_CUSTOM==yourcustomocrissue_level_custom_ocr.txt PDF==pdffilename.pdf (optional) PAGE001_ALTO==altooutput.xml (optional)issue_level_pdf.pdf |
Sample batch folder structure
Code Block |
---|
eid1234_example-batch-submission/ (batch job folder) ├── grapes_of_wrath_BOOK/ │ ├── datastreams.txt │ ├── modsfile.xml │ ├── book_level_custom_ocr.txt │ ├── book_level_pdf.pdf │ ├── page01.tif │ └── page02.tif ├── wall_street_journal_PUBLICATION/ │ ├── datastreams.txt │ ├── modsfile.xml │ ├── wsj_jan_2016_ISSUE/ │ │ ├── datastreams.txt │ │ ├── modsfile.xml │ │ ├── page01.tif │ │ └── page02.tif │ └── wsj_feb_2016_ISSUE/ │ ├── datastreams.txt │ ├── modsfile.xml │ ├── page01.tif │ └── page02.tif ├── ascii_art_monthly_july_2021_ISSUE/ │ ├── datastreams.txt │ ├── modsfile.xml │ ├── page01.tif │ ├── page01_custom_ocr.txt │ ├── page02.tif │ └── page02_custom_ocr.txt └── nyt_2020-11-04_PAGES/ ├── datastreams.txt ├── ├── modsfile.xmlissue_level_custom_ocr.txt ├── issue_level_pdf.pdf ├── page01.tif ├── page01_custom_ocr.txt ├── page02.tif └── page02_custom_ocr.txt |
...