Moorhead_BoT processing log

**READ ME: This processing log was created by David Bliss (UT Libraries Digital Stewardship, Systems Administrator) in fall 2021-summer 2022. Wiki page copied to Architecture & Planning Library wiki by Stephanie Tiedeken on June 22, 2023. This log will be added to as progress on this project is made.

Imaging/staging

Disk image (aaa_moorhead_bot.ad1 through aaa_moorhead_bot.ad376) created using FTK Imager by Misha Coleman 2/21/22.

We discovered that the disk image did not include any of the top level data folders, "Houston Architecture Guide", "SAH BoT 1", "SAH BoT 2", "SAH BoT 2 Fundraising", "SAH BoT 2 Photos", "SAH BoT BIG BOOK", "Texas Architects". The disk image contained only root-level files, hidden and deleted files, and unallocated space. This was believed to have been caused by illegal characters found at the end of many folder and filepaths within these top-level directories, which hid the folders from FTK Imager altogether. Windows Explorer displays the illegal character as a dot, Autopsy displays it as "Black Touchtone Telephone", and Python recognizes it as Unescape/U+F028. A list of paths ending with the unescape character produced in Python matched a list of paths ending with the Black Touchtone Telephone produced in Autopsy.

Examples of illegal characters in folder names as displayed by Windows Explorer (left) and Autopsy (right).

David wrote a Python script to identify and redact all files and folders containing this character. The script uses the os.walk method to search all filepaths at a root directory and the replace method to remove the character. The script also outputs a CSV report, "whitespace_rename.csv", which lists the original and renamed paths.

Removing the illegal characters did not allow FTK Imager to view the directories. It is possible that there is an error with the disk, either the original or the copy. Loading the drive in FTK Imager as "Contents of a Folder" did show the files. A second logical image (aaa_moorhead_bot.ad1 through aaa_moorhead_bot_ad259) was created by David on 3/15/22. The second logical image was smaller than the first (259 sectors vs. 376) because FTK Imager does not copy unallocated space or deleted files when files are loaded as "Contents of a Folder".

On 3/21/22 David extracted files from the logical images to begin processing work. It was discovered that the logical images excluded several hundred files found on the disk, possibly as a result of the same error that hid those folders from FTK Imager when the disk was mounted properly. On 3/23/22 David copied the files directly from disk to the dps volume. The following files were initially skipped by TeraCopy, and were renamed to replace the tilde character (~) with an underscore, which allowed them to be copied:

SAH BoT 1\SAH BoT 1 Working\Email\FWTEXA~1.MSG
SAH BoT 1\SAH BoT 1 Working\Proj Mgmt\ContributorCorresp\HENRY1~3.TXT
SAH BoT 1\SAH BoT 1 Working\Proj Mgmt\ContributorCorresp\JONES2~1.TXT
SAH BoT 1\SAH BoT 1 Working\Proj Mgmt\ContributorCorresp\WINTER~2.TXT
SAH BoT 1\SAH BoT 1 Working\Proj Mgmt\SAH Corresp\COSPER~2.TXT
SAH BoT 1\SAH BoT 1 Working\Proj Mgmt\SAH Corresp\MEMFEE~1.TXT
SAH BoT 1\SAH BoT 1 Working\Proj Mgmt\SAH Corresp\STILLM~1.TXT
SAH BoT 1\SAH BoT 1 Working\SAH\Email\FWTEXA~1.MSG

The "Houston Architecture Guide" folder, which did not correspond to the Buildings of Texas collection, was separated and moved to the AAA disk4 volume on 4/12/22.

Issues with Solr on the processing workstation delayed the creation of an Autopsy case until ____.

On 5/18/2022, David used a Python script (adapted from Jeremy's earlier script) to extract all .msg files as directories with text and attachments. The folders were created in the same location as the .msg file, and use the same name, with "_msg" appended. When providing access to the collection, it would be a good idea to delete the .msg file, since any email text or attachments containing PII will only be redacted in the extracted folders. The original .msg files can be written to tape and retrieved if there were any issues with the extraction process. The script also produced a list of .msg files:

On 6/3/2022, David used Jeremy's VBA script to batch convert all .doc files to .docx. All converted files were created in the same location as the .doc file, and use the same name, with "_doc" appended to note the original file extension. The Python library used to search and redact Word documents can open .docx files, but not .doc files, so the conversion is a necessary step. When providing access to the collection, it would be a good idea to provide access only to the "_doc.docx" copies of the files, since these are the only ones we can be sure will not contain PII. The original .doc files can be written to tape and retrieved if there were any issues with the conversion process.

On 6/6 and 6/7 2022, David double-checked that _doc.docx files had been properly created in all directories. Certain folders whose names still contained illegal characters had been skipped by the VBA conversion script, and in these cases the derivative .docx files were created manually. Some .doc files were Windows 95 binary files that Word was unable to edit (and thereby upconvert to .docx). In these cases, the files were opened, the text was manually copied and pasted into a new document, and the new document was saved as as "_doc.docx" in the same location as the original file. The formatting and characters appeared to translate correctly to the new format, but the potential loss of data is a good reason to retain all original files in the preservation copy of the collection.