Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page explains how to use OpenRefine and xml_split to create DAMS MODS XML from a spreadsheet. The workflow is adapted from the blog post "Converting Spreadsheets into MODSXML using Open Refine" by The Digital Scholarship Unit (DSU) at the UTSC Library. Credit is also due to Benn Chang for adapting the workflow and developing the script for renaming files.

Expand
titleClick here to show the Table of Contents...

Table of Contents

Required software

Installation might require help from LITS to install on your computer.

Auibutton
titleCreate LITS service request (opens in new window/tab)
typestandard
urlhttps://utexas.sharepoint.com/sites/UTLibraries/LIT/SitePages/Service%20Request.aspx
targettrue

  • A good text editor for XML editing - in principle, even Windows notepad should do the job, but for greater comfort we recommend Notepad++ (https://notepad-plus-plus.org/) with the XML Tools plugin. For the Mac, there seem to be fewer free or Open Source tools available, one would be Eclipse (https://www.eclipse.org). If you plan on working a lot with XML files, you might want to look into asking for a license for Oxygen XML (https://www.oxygenxml.com/, available for both Mac and Windows)
  • OpenRefine 2.8 or higher (http://openrefine.org/)
  • A Unix/Linux shell that allows you to run bash shell scripts and Perl commands - e.g. your Mac's Terminal or the Windows Subsystem for Linux (WSL) with e.g. Ubuntu
  • xml_split command-line tool from XML::Twig

    Expand
    titleClick here for installation instructions for macOS
    1. Open Terminal on your Mac.
    2. Log in with admin credentials. Type in your admin password when prompted for it.
      login adminusername
    3. Install XML::Twig. Type in your admin password when prompted for it.
      sudo cpan XML::Twig
    4. Press 'y' to install all tools. If you don't have a compiler installed on your system, you'll be prompted to get one. A compiler is required for the installation to be successful. Xcode is recommended, and was used in this documentation. If XML::Twig installation is successful, the last installation line will read "/usr/bin/make install  -- OK."
    5. Type the logout command to return to your user prompt.
      logout
    Expand
    titleClick here for installation instructions for Ubuntu
    1. Open the Ubuntu command line.
    2. Update software package information. Type admin password if prompted for it.
      sudo apt update
    3. Install the package containing XML Twig
      sudo apt get xml-twig-tools
      sudo apt install xml-twig-tools
  • xpath command-line tool, e.g. from XML::XPath

    Expand
    titleClick here for installation instructions for Ubuntu
    1. Open the Ubuntu command line.
    2. Update software package information. Type admin password if prompted for it.
      sudo apt update
    3. Install the package containing XML Twig
      sudo apt get libxml-xpath-perl
      sudo apt install libxml-xpath-perl

What you do

Use OpenRefine to transform tabular data to XML

If you want to skip this step for practice purposes, you can download an exported XML document collection here.

  1. Start OpenRefine. The interface will open in a web browser. If the browser window with the OpenRefine interface does not appear automatically, open a browser window and navigate to the URL http://127.0.0.1:3333.
  2. Locate your metadata spreadsheet (screenshot) and click "Next".
  3. Preview the data and configure parsing options, e. g. the correct character encoding.
  4. Click "Create Project" to load your metadata into a new OpenRefine project.
  5. Optional: Use OpenRefine for data cleanup, e.g. normalization of names, removal of leading or trailing whitespace.
  6. Click on "Export" to open the export menu.
  7. Select the menu option "Templating...".
    An overlay dialog titled "Templating Export" will appear. The overlay is split into a left-hand templating side and a right-hand preview side (screenshot).
  8. Replace the values in the text field "Prefix" with the following code:

    Code Block
    languagexml
    <modsCollection xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd">
  9. Replace the values in the text field "Row template" with the content of your MODS XML template file.
  10. Empty all the values in the text field "Row Separator".
  11. Replace the values in the text field "Suffix" with the following code:

    Code Block
    languagexml
    </modsCollection>
  12. The preview section on the right half of the dialog overlay updates automatically every time you change values in the text fields on the left half.
  13. Check the preview section and verify that the values from your metadata spreadsheet have been filled into correct position. The Templating Export will create a single XML file containing a collection of XML documents for each asset in your spreadsheet. Note that during the export OpenRefine does not necessarily preserve the order of rows as it was present in your metadata spreadsheet.
  14. Click "Export" and save the resulting text file to your computer.
  15. Rename the exported file to replace the file extension .txt with .xml.

Split the exported XML document collection

  1. Open a Terminal/command line window.
  2. Run xml_split on the XML document collection exported from OpenRefine:

    Code Block
    xml_split /path/to/your/openrefine-export-file.xml

    Tip 1: On the Mac you can drag and drop your exported file into the terminal to avoid having to type its full path.
    Tip 2: If you use the Windows Subsystem for Linux, check how the path to your local hard drive is mounted into the Linux subsystem. Typically, the C: drive is mounted as /mnt/c/.
    If your terminal returns an error like "not well-formed (invalid token) at line 151, column 130, byte 8952 at /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level/XML/Parser.pm line 187.", it might be due to an illegal character. In our example, an ampersand (&) is used inside an XML element value. The ampersands is a reserved character in XML and must be escaped for use as XML element value, e.g. inside a Title. To print a literal ampersand, replace the & character with the string &amp; (https://www.w3.org/TR/REC-xml/#syntax). Example screenshot

  3. The program xml_split will create individual XML document files from the exported file, one for each asset. The individual files will be output to the same directory as the input file. They will have the same name as the input file, with a number suffix (screenshot). Note that the generic filenames of the split XML files will in many cases not be suitable for batch ingest and might have to be renamed.

Rename XML files

The XML files created by splitting the OpenRefine export document have a generic filename that makes them unsuitable for batch ingest with e. g. image files. The batch ingest process requires data files and metadata XML files to have matching names. This section explains how to rename XML files with a script, so that the file names follow the template txu-oclc-{OCLC number}_METADATA.xml, which matches the convention commonly used for material digitized by UT Libraries' Digitization Services. If your data files to be ingested are named according to a different convention and you want help with renaming the XML files to match, please consult with the DAMS managers (utldamsmanagers@utlists.utexas.edu) and/or Digitization services.

...