Convert tabular metadata to XML

This page explains how to use OpenRefine and xml_split to create DAMS MODS XML from a spreadsheet. The workflow is adapted from the blog post "Converting Spreadsheets into MODSXML using Open Refine" by The Digital Scholarship Unit (DSU) at the UTSC Library. Credit is also due to Benn Chang for adapting the workflow and developing the script for renaming files.


 Click here to show the Table of Contents...

Required software

Installation might require help from LITS to install on your computer.

  • A good text editor for XML editing - in principle, even Windows notepad should do the job, but for greater comfort we recommend Notepad++ (https://notepad-plus-plus.org/) with the XML Tools plugin. For the Mac, there seem to be fewer free or Open Source tools available, one would be Eclipse (https://www.eclipse.org). If you plan on working a lot with XML files, you might want to look into asking for a license for Oxygen XML (https://www.oxygenxml.com/, available for both Mac and Windows)
  • OpenRefine 2.8 or higher (http://openrefine.org/)
  • A Unix/Linux shell that allows you to run bash shell scripts and Perl commands - e.g. your Mac's Terminal or the Windows Subsystem for Linux (WSL) with e.g. Ubuntu
  • xml_split command-line tool from XML::Twig

     Click here for installation instructions for macOS
    1. Download stable release of XML::Twig from https://www.xmltwig.org/xmltwig/
    2. Extract the archive
    3. In System Preferences, got to Users & Groups and elevate logged in user from standard to admin. Standard users with an appadmin account can unlock preferences in order to elevate their standard user account.
    4. In Terminal, change directory to the unpacked archive
    5. Paste the Readme file's Installation commands into Terminal
    6. Check for xml_split command in Terminal
     Click here for installation instructions for Ubuntu
    1. Open the Ubuntu command line.
    2. Update software package information. Type admin password if prompted for it.
      sudo apt update
    3. Install the package containing XML Twig
      sudo apt get xml-twig-tools
      sudo apt install xml-twig-tools
  • xpath command-line tool, e.g. from XML::XPath (should be pre-packaged with macOS)

     Click here for installation instructions for Ubuntu
    1. Open the Ubuntu command line.
    2. Update software package information. Type admin password if prompted for it.
      sudo apt update
    3. Install the package containing XML Twig
      sudo apt get libxml-xpath-perl
      sudo apt install libxml-xpath-perl

What you do

Use OpenRefine to transform tabular data to XML

If you want to skip this step for practice purposes, you can download an exported XML document collection here.

  1. Start OpenRefine. The interface will open in a web browser. If the browser window with the OpenRefine interface does not appear automatically, open a browser window and navigate to the URL http://127.0.0.1:3333.
  2. Locate your metadata spreadsheet (screenshot) and click "Next".
  3. Preview the data and configure parsing options, e. g. the correct character encoding.
  4. Click "Create Project" to load your metadata into a new OpenRefine project.
  5. Optional: Use OpenRefine for data cleanup, e.g. normalization of names, removal of leading or trailing whitespace.
  6. Click on "Export" to open the export menu.
  7. Select the menu option "Templating...".
    An overlay dialog titled "Templating Export" will appear. The overlay is split into a left-hand templating side and a right-hand preview side (screenshot).
  8. Replace the values in the text field "Prefix" with the following code:

    <modsCollection xmlns="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd">
  9. Replace the values in the text field "Row template" with the content of your MODS XML template file.
  10. Empty all the values in the text field "Row Separator".
  11. Replace the values in the text field "Suffix" with the following code:

    </modsCollection>
  12. The preview section on the right half of the dialog overlay updates automatically every time you change values in the text fields on the left half.
  13. Check the preview section and verify that the values from your metadata spreadsheet have been filled into correct position. The Templating Export will create a single XML file containing a collection of XML documents for each asset in your spreadsheet. Note that during the export OpenRefine does not necessarily preserve the order of rows as it was present in your metadata spreadsheet.
  14. Click "Export" and save the resulting text file to your computer.
  15. Rename the exported file to replace the file extension .txt with .xml.

Split the exported XML document collection

  1. Open a Terminal/command line window.
  2. Run xml_split on the XML document collection exported from OpenRefine:

    xml_split /path/to/your/openrefine-export-file.xml

    Tip 1: On the Mac you can drag and drop your exported file into the terminal to avoid having to type its full path.
    Tip 2: If you use the Windows Subsystem for Linux, check how the path to your local hard drive is mounted into the Linux subsystem. Typically, the C: drive is mounted as /mnt/c/.
    If your terminal returns an error like "not well-formed (invalid token) at line 151, column 130, byte 8952 at /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level/XML/Parser.pm line 187.", it might be due to an illegal character. In our example, an ampersand (&) is used inside an XML element value. The ampersands is a reserved character in XML and must be escaped for use as XML element value, e.g. inside a Title. To print a literal ampersand, replace the & character with the string &amp; (https://www.w3.org/TR/REC-xml/#syntax). Example screenshot

  3. The program xml_split will create individual XML document files from the exported file, one for each asset. The individual files will be output to the same directory as the input file. They will have the same name as the input file, with a number suffix (screenshot). Note that the generic filenames of the split XML files will in many cases not be suitable for batch ingest and might have to be renamed.

Rename XML files

The XML files created by splitting the OpenRefine export document have a generic filename that makes them unsuitable for batch ingest with e. g. image files. The batch ingest process requires data files and metadata XML files to have matching names. This section explains how to rename XML files with a script, so that the file names follow the template txu-oclc-{OCLC number}_METADATA.xml, which matches the convention commonly used for material digitized by UT Libraries' Digitization Services. If your data files to be ingested are named according to a different convention and you want help with renaming the XML files to match, please consult with the DAMS managers (click here to submit a DAMS service request) and/or Digitization services.

  1. Download the rename_xml.sh script file and save it to your computer.
  2. Open a Terminal/command line window.
  3. Navigate to the directory containing your set of XML files that should be renamed.
  4. Create a new directory where the renamed files should be copied to, e. g. "renamed".
  5. Check that your XML files contain an identifier element with the type 'oclcSource'.
  6. Edit the following command to match your file paths:

    find . -maxdepth 1 -type f -name "*.xml" -exec path/to/rename/script/rename_xml.sh {} output-path/ \;

    "output-path/" is the directory path where the renamed files should be copied to. The output path name must end with a forward slash.

  7. Run the adapted find command to find all XML files in the current directory and apply the rename_xml.sh script to them.
    The script will copy each XML file in the current folder to the output directory and rename it according to the template txu-oclc-{OCLC number}_METADATA.xml.