Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The find command is a powerful – and of course complex! – way of looking for files in a nested directory hierarchy. The general form I use is:

find <in_directory> [ operators ] -name <expression> [  tests ]

  • looks for files matching <expression> in <in_directory> and its sub-directories
  • <expression> can be a double-quoted string including pathname wildcards (e.g. "[a-g]*.txt")
  • there are tons of operators and tests:
    • -type f (file) and -type d (directory) are useful tests
    • -maxdepth NNis a useful operator to limit the depth of recursion.
  • returns a list of matching relative pathnames, relative to <in_directory>, one per output line.

Examples:

Code Block
languagebash
cd
find . -name "*.txt" -type f     # find all .txt files in the Home directory
find . -name "*docs*" -type d    # find all directories with "docs" in the directory name

...

TBDWhen dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links (symlinks) to those files in a directory where you plan to work with them. You can use them in your analysis as if they were local to your analysis directory, without the storage cost of copying them.

Tip
titleAlways symlink large files

Storage is a limited resources, so never copy large data files! Create symbolic links to them in your analysis directory instead.

The ln -s <path_to_link_to> [ link_file_name ] command creates a symbolic link to <file_to_link_to>.

  • ln -s <path> says to create a symbolic link (symlink) to the specified file (or directory) in the current directory
    • always use the -s option to avoid creating a hard link, which behaves quite differently
  • the default link name corresponds to the last name component in <path>
    • you can name the link file differently by supplying an optional link_file_name.
  • it is best to change into (cd) the directory where you want the link before executing ln -s
  • a symbolic link can be deleted without affecting the linked-to file




Code Block
languagebash
mkdir -p ~/syms; cd ~/syms 
ln -s -f /stor/work/CCBB_Workshops_1/bash_scripting/data/sampleinfo.txt
ls -l


mkdir ~/test; cd ~/test
ln -s -f /stor/work/CCBB_Workshops_1/bash_scripting/data/sampleinfo.txt
ls -l

Multiple files can be linked by providing multiple file name arguments along and using the -t (target) option to specify the directory where links to all the files can be created.

rm -rf ~/test; mkdir ~/test; cd ~/test
ln -s -f -t . /stor/work/CCBB_Workshops_1/bash_scripting/data/*.txt
ls -l

What about the case where the files you want are scattered in sub-directories? Consider a typical GSAF project directory structure, where FASTQ files are nested in subdirectories:


About compressed files

Because a lot of scientific data is large, it is often stored in a compressed format to conserve storage space. The most common compression program used for individual files is gzip whose compressed files have the .gz extension. The tar and zip programs are most commonly used for compressing directories.

...