Workshop- Command-Line Tools for Digital Preservation

On Monday, June 13, I had the opportunity to attend a work-in-progress workshop taught by Mark Phillips (UNT's Asst. Dean for Digital Libraries), which gave a basic introduction to Unix-based command-line tools for digital preservation.

The workshop consisted of a very basic intro to the command-line environment, and then briefly introduced some things you can do with it. For a more thorough review of things that went on at the workshop, I've attached a text file with my somewhat decipherable notes, a print-out of my Terminal session during the workshop, and Mark's ultra-minimal powerpoint.

Here I'll limit things to a few general points:

Some digital preservation tasks that can be performed with the command-line:
- File characterization, i.e. identifying file format and other features of a file that are difficult to view and extract in a GUI. Mark pointed out that file extensions are arbitrary, and really tell us nothing about what a file is.
- Checksum generation
- Metadata quality control, usually through generating sorted lists of metadata fields, which make it much much easier to identify errors and typos
- Creating packages of items for transmission, which for instance can be accomplished with BagIt
- I'm sure there's many others, but this seems like a solid foundation, and really got me interested in finding out more
A great introduction to working in the command-line environment ("the shell" in Unix terminology): http://linuxcommand.org/learning_the_shell.php
The concept of piping within the command-line was a revelation to me. Basically, piping allows you take the output from one command and input it into another command without saving intermediary files. There's really no limit how many commands and tools can be piped (or you can think of them as being "chained") together. Here's an example, where Mark showed how to determine the number of pages in a PDF file:
```
 pdfinfo UNT-open-access-symposium-2011.pdf | grep Pages
```
- The output of this command is:
```
Pages:     48
```
  Which is arrived at by taking the output of the "pdfinfo" command and piping it into the "grep" command. "grep" is used to pull a character string or field (in this case, the "Pages" field) out of a given block of text. It's not hard to see the possibilities from here-- for me, I immediately started thinking of extracting technical metadata from a large amount of image files, such as pixel dimensions and byte sizes