Table of Contents |
---|
Multiexcerpt include |
---|
|
Note | ||
---|---|---|
When you are ingesting PDF files as paged content with the intention to have the document split up into individual pages, prepare your source files as follows:
| ||
Info | ||
| ||
|
Directory Structure for Batch Ingest of Paged Content
...
Argument | Value Associated | Purpose | Accepted Data Type | Additional Notes |
---|---|---|---|---|
MODS | name of your mods file | provide MODS for your resource | xml | |
TN | thumbnail picture | provide thumbnail picture for your resource | png, jpg, jpeg | |
LANG | one of the supported language codes here | To have OCR created for each page | text | |
PAGE + NUMBER | name of the file with the page object | actual page content | tiff, tif, jp2, jpg, jpeg | |
PAGE + OCR_CUSTOM | name of your ocr file for that page | allows you to provide your own OCR | textfile | |
FULL_TEXT_CUSTOM | name of your full text for your pdf file | allows you to provide your own FULL_TEXT for the pdf | textfile | |
name of your pdf file | PDF for resource, will be cut up into individual pages if individual pages are not provided | pdf file | Do not provide PDFs at the page level. Only provide them at the book/issue level. If you do provide them we cannot guarantee they will not be overwritten by inferior quality system generated PDFs. | |
HOSTPUBLICATION | pid | Add issue(s) to publication | text | |
HOSTISSUE | pid | Add pages to an issue | text | |
HOSTBOOK | pid | Add pages to a book | text |
...
Sample manifest (datastreams.txt) file for a book and/or publication:
Panel | ||||
---|---|---|---|---|
| ||||
MODS==modsfilename.xml TN==thumbnailfilename.jpg OTHERDATASTREAM==filename.ext |
Adding Custom Datastreams
You can provided custom datastreams with your own custom defined naming convention. Please do not use any of the Restricted Datastream IDs and add the suffix '_CUSTOM' to your Datastream ID.
MODS System Generated Fields
- keydate: if batch ingesting paged content, users need to add keyGen="yes" attribute to appropriate MODS element (dateCreated or dateIssued) to indicate which is keydate. If there is no keyGen, issues will appear to be missing from your Publication. Find issues with missing keydates is by going to the publication > Manage > Publication. Fix problem by editing MODS for problematic issue.
- recordCreationDate
- identifier utldamspid, utldamsuri, filename
- relatedItem UTLDAMS digital collection and subelements
OCR Selection Anchor ocr_selection ocr_selection
ocr_selection | |
ocr_selection |
During a batch ingest of Paged Content, the specified OCR language must be supported in order for OCR to run. For further documentation about text extraction see here. The language should be set to the 3 digit language code. Currently supported languages and their codes are listed here. If OCR is specified and validated as a supported language, then the OCR will be run on the child pages. If specifying a language, the following line needs to be added to manifest:
Panel |
---|
LANG==[language code] |
Info |
---|
If a language other than one of the enabled ones mentioned above is added to the manifest, the object will still be ingested but OCR extraction will not occur. |
...
Warning |
---|
The ingest process will result in duplicate page numbers if the manifest assigns newly ingested images a page number that already exists in a partially ingested book. Make sure to carefully assign page numbers to page images that are to be added to an existing book and avoid overlap with existing page numbers. |
- Place appropriate following argument in your manifest where after the == is the book/publication PID without the namespace (ie utblac, utlarch).
- HOSTPUBLICATION
- HOSTBOOK
- HOSTISSUE
- When you submit the batch request the pid of the sub-collection is ignored
Sample manifest (datastreams.txt) file when ingesting page(s) and/or issue(s).
Panel | ||||
---|---|---|---|---|
| ||||
HOSTPUBLICATION==e0026f7d-9a79-4a8d-8a83-efc153c6a449 PAGE001==firstpagefilename.tiff PAGE002==secondpagefilename.tiff PAGE001_MODS==firstpagefilename.xml PAGE002_MODS==secondpagefilename.xml PAGE001_OCR_CUSTOM==firstpageocrfilename.txt (note: must be text file to get properly indexed) PAGE002_OCR_CUSTOM==secondpageocrfilename.txt FULL_TEXT_CUSTOM==yourcustomocr.txt PDF==pdffilename.pdf (optional) PAGE001_ALTO==altooutput.xml (optional) |
3. When adding pages the directory with the pages to add should have the suffix '_PAGES'
Panel | ||||||
---|---|---|---|---|---|---|
| ||||||
|