Managing Disk Space in UNIX

Intro

In UNIX, files are stored in directories. Directories may also contain other directories called subdirectories. Each file or directory is arranged in the UNIX filesystem which is tree based. The top of the tree is called root, and is notated by /. Every other file or directory is notated by listing out the directories between it and /. For example, /usr/bin/bash is a file that is contained in the directory /usr which is contained in the directory /. This is called the full path name of the directory or file. You can also use the fact that every directory contains two entries . and .. which have special meanings. The entry . is a shorthand reference to the directory that you are currently in, while .. means the parent directory. For example, in /home/someuser/SpecialDirectory/, . can be used to reference /home/someuser/SpecialDirectory, while .. can be used to reference /home/someuser. This is known as a relative path name. You can also chain these together. That is,

..

=>

/home/someuser

../..

=>

/home

../../..

=>

/

This is very useful when copying files around, so it's well worth getting into the habit of remembering where you are in the filesystem. If you forget you can always run the pwd command.

Ultimately every file on every computer is stored on a disk. Unlike Windows which has drives ( eg, C:, D: ) which you have to know to access, UNIX abstracts the disk away making the mapping between full path name, and drive when the file referenced by the name is accessed. You do still need to worry about this mapping, though, because you have to be aware of the fact that the disk you are working on can only store so much data. It's also useful to know when disks are network drives that are mounted from another computer. The df command is used to examine the disk usage of the system. By itself it shows each independent disk, and the location where that disk is grafted onto the filesystem. It's probably more useful to ask about specific locations using an command argument like ~ (your home directory), . (the current directory), or some specific directory where you are about to copy a lot of data (say /share/scratch). In this case, df with respond with the disk usage of the the disk that contains that particular file or directory. People using a Linux computer can use the -h option which will cause disk sizes to be listed out using human readable. Here is an example

-bash-3.2$ df -h /share/apps
Filesystem            Size  Used Avail Use% Mounted on
darwin-172.local:/export/apps-x86
                       48G   23G   26G  47% /share/apps

showing that /share/apps

Keep in mind that when viewing this information, you may not have access to all of the space available. For example, the disk space for home directories is shared amongst all users on the system. The disk space for lab directories is shared by all members of the lab. Likely quotas will be used at some point to enforce fair use of some shared resources at CCBB.

What is my Disk Usage?

Now that you have a grip on the basic fact that disk space is a limited resource, it is time to learn how to manage it. One important question to answer is, "where did my space go?" It would be very painful to remove only a few small files while leaving the bulk of your data around. Instead, you

Foo

When it comes time for cleaning up, you may want to only remove select files. This can be a bit painful because often you may remove smaller files and make very little headway towards actually cleaning up. In this case, I use

du -ka * | sort -rn | cut -f2 | xargs -d '\n' du -ha | more

examine all of the items in your current working directory, and produce a listing of their size, and if they are directories, the sizes of each of the items inside them. The first du just produces a raw number that is used to sort the list from biggest to smallest. Then du is called again to produce a human readable output. The problem is that this will mix in files and directories, because the size of the directory depends on the size of the files inside the directory. For example, you'll see things like

28K a/b/c/f1
252K a/b/c/f2
3.1M a/b/c/f3
3.0M a/b/c/f4
6.3M a/b/c
2.8M a/b/d/f5
252K a/b/d/f6
2.9M a/b/d/f7
28K a/b/d/f8
5.9M a/b/d
252K a/b/e/f9
2.8M a/b/e/f10
2.9M a/b/e/f11
28K a/b/e/f12
5.9M a/b/e
18M a/b
...

This would be followed by the remaining items in decreasing size. These might be in directory a/b, or directory a, or in some other directory queried on the original command line. You can see that directory a/b/c consists of 4 files which use a total of 6.3 mb of space. Likewise, a/b/d has 4 files for a total of 5.9 mb, and a/b/e has 4 files for a total of 5.9 mb. Collectively a/b then holds 18 mbytes worth of files.

Likely this is what you want which is that the largest items go at the front. You can also remove the -a from each of the du's. In this case du will show the requested items that you ask for, and any subdirectories. For example, the above collapses to

6.3M a/b/c
5.9M a/b/d
5.9M a/b/e
18M a/b

This tells you which directories to examine. Finally, you can use -s instead of -a to get a summary which will show the size only of the requested items. In this case I'd just get

36G a

because in total the directory a contains 36GB.