Mass Editing Finding Aids

This guide serves as an introduction to using regular expressions for mass editing of metadata, in this case applied to XML finding aids. This workflow can be useful for making updates to fit TARO Best Practice guidelines or to remediate harmful language across multiple finding aids. It can be adapted for use in any XML editor (Notepad++, Oxygen, etc.) The examples below use Notepad++ and Oxygen.

For more information about regular expressions (regex), check out the tools buttons below.

Tools




Workflow


Determine what set of finding aids you want to edit. These finding aids should be in the same directory level. Making edits across a directory allows for easier quality control, as edited values and/or errors should consistently appear across items within it. 

Screenshot of a folder of finding aids

Which regular expressions to use depend on what edits need to be made to the finding aid. The examples below detail three scenarios:


 Creating a new section in a finding aid

This example scenario adds a new controlled access section to a set of finding aids.

Open the Find window by either pressing Ctrl + F or using the "Find" tab in the top menu. "Search → Find → Find in Files" in Notepad++ or "Find in Files" in Oxygen.  Choose the folder of files to edit in the "Directory" dialog box. This view can be seen in the screenshots below. 

Screenshot of Notepad++ with directory filepath of files to be edited written in.


Screenshot of Oxygen with directory filepath of files to be edited written in.


Next, determine where to insert the controlled access section. In this example, some finding aids have a related material section and some do not. Two find and replace actions will be needed in order to place the controlled access section after related material or user restrictions. 

Before implementing any find + replace actions, the following options should be chosen in the "Find in Files" menu.

For Notepad++, these are located in the "Search Mode" box. Choosing "Regular expression" allows regex syntax to be interpreted, while "wrap around" allows for long text to wrap to the next line, and "matches newline" allows for regex syntax to create new lines of text.

Screenshot of Notepad++ Search Mode box.

In Oxygen, the "Regular expression" box should be checked so regex syntax can be interpreted.

Screenshot of Oxygen's search options.




First, use the "Find All" button in the "Find in Files" window to see all instances of the ending </relatedmaterial> or </userstrict> tag. More than one value can be searched for by using | between each value.


Using "Find All" returns a list of results as seen in the screenshot below, indicating what line of the finding aid the content is on. 

Screenshot of Notepad++ with all instances of </userestrict> or </relatedmaterial>.


Screenshot of Oxygen with all instances of </userestrict> or </relatedmaterial>.


There are two finding aids that only have a user restrictions section and two with a related material section. The find + replace action will need to be run twice for both of these variations. This can be accomplished by separating each set into different folders, then using the "Replace in Files" in the "Find in Files" menu to make changes to all finding aids in the folder (ensure the correct folder is selected in the "Directory" drop down box.)


Next, enter in the "Replace with" dialog box the <controlaccess> section and its related text and elements. In the code block below, the </relatedmaterial> tag is placed first, to ensure the control access section appears after it. Then follows the related <head> tag indicating the specific kind of subject terms being added; in this example, it is geographic terms. The \n newline and \t tab regex syntax recreate the hierarchical structure needed in a finding aid. This may be a trial and error process in order to get the exact formatting.


Notice that a \ is placed before each parentheses in the text "Austin (Tex.)" Parentheses need to be escaped as they are used in regex.

</relatedmaterial>\n\t\t<controlaccess>\n\t\t\t<head>Index Terms</head>\n\t\t\t<controlaccess>\n\t\t\t\t<head>Places</head>\n\t\t\t\t<geogname source="lcsh" encodinganalog="651">Austin \(Tex.\)</geogname>\n\t\t\t</controlaccess>\n\t\t</controlaccess>

Repeat this process for finding aids that have only a </userestrict> section.

</userestrict>\n\t\t<controlaccess>\n\t\t\t<head>Index Terms</head>\n\t\t\t<controlaccess>\n\t\t\t\t<head>Places</head>\n\t\t\t\t<geogname source="lcsh" encodinganalog="651">Austin \(Tex.\)</geogname>\n\t\t\t</controlaccess>\n\t\t</controlaccess>

In Oxygen, you can preview the results of the regex expression before running it on the folder of finding aids. Here, we can see that the changes are incorrectly being applied to the </userestrict> tag, not the </relatedmaterial> tag. 

The following screenshot shows the results of the find + replace regex action.

 Adding content to an existing finding aid section

This example scenario adds content statements to a set of finding aids that already have a processing information section.

Open the Find window by either pressing Ctrl + F or using the "Find" tab in the top menu. "Search → Find → Find in Files" in Notepad++ or "Find in Files" in OxygenSelect the "Find in Files" tab and choose the folder of files to edit in the "Directory" dialog box. This view can be seen in the screenshot below. 

Screenshot of Notepad++ with directory filepath of files to be edited written in.


Screenshot of Oxygen with directory filepath of files to be edited written in.


Next, determine where to insert the content statement text. Following UT Libraries guidelines, this should be entered in the processing information section. In these finding aid examples, there is already a statement in the processing information section, indicating who created the finding aid. The content statement will need to be placed after it, as well as being properly indented underneath the <processinfo> and <head> tags.

By finding and replacing with the ending <processinfo> tag, the content statement can be consistently placed after other existing processing information content.

Before implementing any find + replace actions, the following options should be chosen in the "Find in Files" menu.

For Notepad++, these are located in the "Search Mode" box.  Choosing "Regular expression" allows regex syntax to be interpreted, while "wrap around" allows for long text to wrap to the next line, and "matches newline" allows for regex syntax to create new lines of text.

Screenshot of Notepad++ Search Mode box.


In Oxygen, the "Regular expression" box should be checked so regex syntax can be interpreted.

Screenshot of Oxygen's search options.




First, use the "Find All" button in the "Find in Files" window to see all instances of the ending </processinfo> tag. Using "Find All" returns a list of results as seen in the screenshot below, indicating what line of the finding aid the content is on. This action identifies how many finding aids do have the section (if not, another strategy will need to be used) and if the tag is repeated elsewhere in other sections. In this case, the tag is unique enough to ensure the content statement will be added only in that section. 

Screenshot of Notepad++ with all instances of </processinfo>.


Screenshot of Oxygen with all instances of </processinfo>.


Next, enter in the "Replace with" dialog box the content statement text and related <p> tags (paragraph). \t means "tab" and will create a space/indent. \n means "newline" and will put the </processinfo> tag on the following line. The following statement will indent the content statement twice, then create a new line, and finally indent twice again to put the </processinfo> tag underneath the content statement.

Screenshot of Notepad++ with the "Find in Files" window open.


Screenshot of Oxygen with the "Find in Files" window open.


In Oxygen, you can preview the results of the regex expression before running it on the folder of finding aids. Here we can see that the content statement is being added correctly after the processing statement.


Select "Replace in Files" in the "Find in Files" menu to make this change to all finding aids in the folder. The following screenshot shows the results of the find + replace regex action.

 Editing content within a finding aid section

This example scenario edits a set of existing subject terms that have harmful language.

Open the Find window by either pressing Ctrl + F or using the "Find" tab in the top menu. "Search → Find → Find in Files" in Notepad++ or "Find in Files" in Oxygen.  Select the "Find in Files" tab and choose the folder of files to edit in the "Directory" dialog box. This view can be seen in the screenshot below. 

Screenshot of Notepad++ with directory filepath of files to be edited written in.


Screenshot of Oxygen with directory filepath of files to be edited written in.


Next, examine where the subject terms you want to edit are located in the finding aids. Depending on the language, the terms to edit may appear in other parts of the finding aid. Using element tags can help to avoid unwanted edits.

It can be helpful to leave "Match case" unchecked in order to catch any terms using varied syntax.

In this case, the only instance of the term is in the subject terms section. If the terms existed in other parts of the finding aid that should not be edited, adding the element tag can help in confining the changes to the subject section.

Screenshot of Notepad++ with all instances of the term "Slaves" being used.


Screenshot of Oxygen with all instances of the term "Slaves" being used.




Select "Replace in Files" in the "Find in Files" menu to make this change to all finding aids in the folder.

Screenshot of Notepad++ with the "Find in Files" window open.


Screenshot of Oxygen with the "Find/Replace in Files" window open.


In Oxygen, you can preview the results of the regex expression before running it on the folder of finding aids. Here we can see that the term is being successfully replaced.


The following screenshot shows the results of the find and replace regex action.