CCBB File Storage

PLEASE LET US KNOW IF SOMETHING ABOUT THESE NOTES ARE NOT CLEAR, OR IF SOMETHING NEEDS TO BE ADDED.

Introduction

CCBB provides centralized storage on the server files.ccbb.utexas.edu. Usage of this server gives many benefits. First, files kept on the server are available on any system which has a network connection. This makes it much easier to share files between computers. The files stored on files.ccbb.utexas.edu will also be available on our clusters making them easier to use. Besides sharing files when you want to, storing files on the server is a nice disaster recovery feature. If your desktop, or laptop computer were to have problems, it could be very time consuming to try to get your data moved onto a replacement computer. In some cases, it might not even be possible, or it might require that we send your computer's hard disk drive to a company offering specialized data recovery services which can be extremely expensive. In the past we have tried to avoid these disaster recovery problems by attempting to back up all systems. This is a good, but has become a problem. Many people travel without notice to us, and their computers then do not get backed up. A related problem is people turning their computer off at night, or just not being around when the backup server wants to perform a backup. This makes it hard for us to determine whether we have a problem that needs to be addressed, or whether it's a temporary problem that can be ignored.

By storing files on the server, your files can be accessed from anywhere at anytime. You can drag and drop your files over anytime you want. While we may not back them up immediately, at least they are stored on redundant hardware. You can even work on the files directly on the server if you choose to. You can connect to files.ccbb.utexas.edu from anywhere in the world. Another problem with keeping data on individual computers is sometimes their owners have not been aggressive about reporting computer problems. Backups are not effective if the files are already damaged. All you do is backup a damaged file, if that. By keeping your files on the server, you keep them stored on a system that we can monitor directly. Finally, desktops and laptops are not meant to be high performance computers. Between that, and older hardware we actually spend more time backing some systems up than we do backing up our servers even though we are backing up less total data.

Note: CCBB backups are disaster recovery only. We keep typically 1-3 months worth of tapes before re-using them. This means we do have some flexibility in recovering files for a while if you accidentally remove something, or if you need an older version of an edited file restored. Please don't remove data thinking that we will permanently have it available for you. Instead, you must make sure that anything you have removed is safely stored on stable media like CD, DVD, or tape either because we have done it for you, or because you have done it yourself.

General Notes

Overall, having the server is a good step forward, but it does come with some caveats.

  • DO NOT store Category I data on the server. CAT I data is data which must legally be protected, or which has high integrity requirements. Please see our ISORA page for more information, but because of the strict security requirements needed for CAT I data systems, files.ccbb.utexas.edu is not certified to store it. If you want to store CAT I data the best way to do so is via Webspace, or Austin Disk Services, or both since they don't have the same feature set. Webspace, for example, makes it easy to share files with non-UT people. Austin Disk Services has better features (especially if you are a Windows user), and we think it is easier to use. It does require payment, but we have access to the ITS TRAC system, so we can help you pay for it with an account number.
  • If you want to use the file server from off campus you will either need to use UT's VPN service, or else you will need to use a secure file transfer program (see below). The easiest way to do to use the VPN is by visiting https://vpn.utexas.edu and logging in. More info on the VPN service can be found on the ITS VPN page.
  • Fundamentally it is your responsibility to store the data you want protected, and backed up. This may mean developing some new habits, and changes to how you are using your computer. You will need to become familiar with the ways that you can connect to the server, and you will need to develop the habits to copy the data over that needs to be saved.
  • Keep in mind the special storage locations that your computer may use such as the Downloads and Documents folders in MacOSX, or the My Documents on Windows. As you download files, or modify them you will need to ensure that they end up being copied over to the server. You should also keep in mind that you may have to export data to store it on the server. This is particularly true when you are first using the server with a given computer, and then care must be taken to make sure you look to make sure that all of your files are copied over. As example, bookmarks may need to be exported from browsers if you wish to have them saved. Or email may need to periodically exported, and saved. For email, a better choice is to sign up for the Austin Exchange Messaging System. See our Email page for more information.

Everyone will have at least one way to store files: their home directory, or folder. This is based off your EID, and is the primary location to store files when you are working on a cluster, or another UNIX based system. If you are doing this please be sure to read the UNIX login notes below. This is shared space, so the total usage here should be kept within some sort of sane limits. A few gigs at the most should be used. Members of the Hillis, Bull, Canatella, Chen, and Hofmann Labs have bulk storage space that is named PILab where PI is the name of your lab head (eg, HillisLab) which should be the primary place to store files. Yes, this means that sometimes you will have to copy data back and forth, but this is not difficult once you get used to the access methods below. For the actual use of the lab storage space, you should consult your lab head since each person may have their own idea about how want you to use it.

Most of all remember that this is new technology, so please try to give yourself the time to acclimate to using this powerful piece of technology. Also please keep in mind that this new for your peers also, so if problems develop with someone's usage of the system and that usage effects you please let us know, so that we can deal with it.

If you need help, or need to report a problem with the server please email us at remark at ccbb.utexas.edu. Note that this does not say gripe, like many computers centers on campus use because we believe that pointing out a problem needing to be fixed, and complaining are two different things.

Specific Usage Instructions

Windows Style Sharing

Windows file sharing (SMB) can be used from Mac, Windows, and some Linux systems. When you use SMB to connect to the server, you will seen a new disk icon appear which you can then click on to open up. This then behaves like any other disk you have your computer. You can drag and drop files, or click on them to open them up for editing. This is the nicest method of using the server, as you can keep all of your files on it, and work on them as normal.

Windows File Sharing for PCs

To connect to files using a PC open up the Computer (Vista, Windows 7), or My Computer (XP) icon via the Start Menu. Once open, Vista and Windows 7 users will see an option "Map Network Drive" which you can click on. Windows XP users should go into the Tools menu, and select "Map Network Drive". Either way the Map Network drive window will appear on your screen. The only information you need to provide is the Folder. The Drive letter can be left as is, and is only used if you use the Command prompt. The right folder name is

\\files.ccbb.utexas.edu\EID

for your home folder (replace EID with your username), or

\\files.ccbb.utexas.edu\PILab

to access your lab's shared storage space. Click "Finish" and then you will be prompted for your login information. You should use "files\EID" (replace EID with your actual username), and provide your password. Once connected a new disk will appear in your Computer or My Computer window. You can open this up, and work on your files however you wish. When done, close any open files, right click on the disk, and choose "Disconnect" to disconnect from the server.

You should additionally read down below in the section on Secure File Transfer where we talk about problems with ASCII vs Binary files. Since Windows file sharing as is, if you are editing text files that will be used on the cluster you should use of of the listed editors to ensure that the files you create are formatted properly.

Windows File Sharing for Mac

To connect to files using your Mac, starting by clicking in the Main desktop window. This changes the menu bar to that of the Finder. Next go into the "Go" menu, and select "Connect to Server...". The right address to use is

smb://files.ccbb.utexas.edu/EID

to access your home folder on the server (replace EID with your actual user name), or

smb://files.ccbb.utexas.edu/PILab

if you want to access your lab's shared folder. If you wish, you can also make this a Favorite Server which means you won't have to keep entering the information. The username to be given to the server is "files\EID" where again EID is your actual user name. You might also get asked to save your password in the account's keychain. Please be sure not to do this when using someone else's computer, or account!

Once connected, you may see a disk appear on your computer. Once that happens you can open it up, and begin working on your files. Once done, close any open files, and drag the disk to the Trash to disconnect from the server. On some Macs, you might have to open a Finder window to see the disk. This is a Finder property that you can change, or you can leave it like this. To disconnect from the server in this case, you click on the eject icon which is next to the server name.

You should additionally read down below in the section on Secure File Transfer where we talk about problems with ASCII vs Binary files. Since Windows file sharing as is, if you are editing text files that will be used on the cluster you should use of of the listed editors to ensure that the files you create are formatted properly.

Windows File Sharing for Linux

This will be documented later, as it seems likely that this will not used by a lot of people. But just in case, KDE users can use Konqueror to do this, or Gnome's GnomeVFS also supports smb:// URLs. You can also use smbclient in a Terminal window.

Secure File Copy

Secure File Copy can be used to access your files from any system. However, it is not as nice as the Windows file sharing. To use SFTP, you will need to find a SFTP client for your computer. Some popular ones for Mac are Fugu and Cyberduck. Mac also provides a command, sftp, which can be used in a Terminal window. For Windows, WinSCP is a good choice. As with SMB file sharing above, Linux users can use Konqueror or GnomeVFS, or they can the sftp command in the Terminal window.

These programs all use the Secure Shell protocol which requires that you accept a unique host key for each host that you log in to. Here is an example, of what you might see when using the sftp command in Mac or Terminal Linux window:

$ sftp files.ccbb.utexas.edu
Connecting to files.ccbb.utexas.edu...
The authenticity of host 'files.ccbb.utexas.edu (129.116.79.132)' can't be established.
RSA key fingerprint is 1a:86:e8:12:a8:65:0f:2d:eb:b7:eb:ca:40:f9:e7:d5.
Are you sure you want to continue connecting (yes/no)?

This is the right key for the server, so you can go ahead and confirm it. This is only done the first time; subsequent uses of these programs will recognize the key.

Once connected, your experience will vary. If you use one of the commandline tools you'll need to use various commands like 'cd', 'put', and 'get' to transfer files. If you are using a Linux desktop, and use Kongueror, or GnomeVFS then you'll have a Window full of folders that you can access. For the the PC, and Mac GUIs you'll get a Window with two panes. One will list the files on your computer, and one will list out the files on the server. Here again you'll need to know where your files are located because in each pane you'll need to use a drop down box to locate the folders that have the files you need. Note that when you are logged in for the first time you are in your home folder (/home/EID), so if you need to put things in your lab folder, you'll need to use the remote host drop box to first go to /, and then to /share, and then to /share/PILab. Once you have found the right location, you can drag files from pane to pane. Some clients may also support dragging to or from your Desktop.

Ascii vs Binary

There is another subtlety here. There are two types of files: text, and binary. Binary files can contain any type of information while text files are limited to only alphanumeric characters plus special formatting codes to indicate where lines of text end. Word document files are binary for example, but files edited in Notepad are text. In the early days of computers, two codes were needed and they were called carriage return, and new line. This was needed because teletypes were too slow to print at the rates that computers could send information to them, and the new line was added to slow down the computer. Eventually hardware speeds caught up, and the obvious split in operating system support for text files occurred. UNIX (including our Linux clusters) uses only the newline character in text files. Mac (including both OSX and pre-OSX) uses carriage return. Finally, Windows uses both carriage return, and line feed. As a result of this bifurcation, some early file transfer programs provided transfer modes ascii and binary. When a file is transferred in binary mode it's transferred exactly as is. When a file is transferred in text mode any end of line characters are translated so that the file is usable by text editors on the remote system. If the file wasn't actually text, then of course, randomly end of line characters are found and translated leading to corruption. Therefore, if asked use binary mode. Text files can always be fixed by running unix2dos/dos2unix, or mac2unix/unix2mac on the UNIX system. Alternatively, you can use one of Crimson Editor, or Text Wrangler as the have settings to save files in UNIX mode. Then you can just always use them to edit the text files you upload and download in binary mode.

Unix Login Access

If you are logged in to one of our UNIX systems, or to one of our clusters, you can access your files in one of two ways. First, when you log in you will in your home folder or directory. These files can be be access in the bash shell using ~ or using /home/EID. It's being assumed that you understand how to work on the UNIX commandline; if not, then you probably need to visit the New Users Guide which has some references to books which you might want to refer to. If you need to access files in your lab's shared directory, then they are in /share/PILab (eg, /share/HillisLab).

The fact that both your home directory, and lab folder are on the file server is not important. UNIX commands work on them as expected, and you can use editors or commands to process the files as you normally would. However, the UNIX ability to multitask presents a big problem which can quickly cause problems for the other users of files.ccbb.utexas.edu. For most uses of the server, very little data is sent to and from the server. For example, when editing files there will be big pauses while you are thinking Even if they are rapidly typing there will still be small delays between when data is sent over to the server. This presents very little load on the server. Likewise, if someone is only periodically copying files to or from the server there is very little load. As long as this is the case, we will all be happy because the server will respond in time to make us think we are the only one using it. On the other hand, if you launch several programs to process your files on a UNIX system, or cluster and they do large amounts of reading or writing of files, then the server will very quickly become overloaded. Besides the fact that this slows down your overall processing, it makes it very hard for the interactive users to work since they will experience delays between when they tell the server to do something, and it actually happening.

As a result, when doing heaving processing of files you should first copy file into a local working directory. Then after processing the file, you can copy any results which need to be saved back into you home or lab folder. This can be done in several ways. First, if you are using a cluster, then each cluster node has a directory called /state/partition1. If you write a job script, or run an interactive job (see our the tutorial on our cluster help page ), then it's essential that you make a temporary folder in /state/partition1 copy over files that you need to process, and process them locally. Once done, you should copy files that you want to save back over to your home directory, or to your lab's directory, and then remove your temporary directory.

Rsync

This is an advanced topic which has its own page . Before reading that page, though, please note that you do not have to use rsync to share files between CCBB UNIX systems as those files should be available on all of the system that we manage. At most you should rsync data from your laptop to a CCBB system, or vice versa. Also, for the most part you should really rsync data to or from files.ccbb.utexas.edu, as this would yield the most amount of performance.

Monitoring Disk Usage

As you use the use the server you should track your usage. Principally you'll want to know how much you are using, and you'll want to know how much free space is available for you to work, upload files, etc. This can be done in a variety of ways.

First, if you are logged in to a UNIX system you can use the df command to view how much space is left. For example, if I run

df -kh .
Filesystem            Size  Used Avail Use% Mounted on
darwin4:/export/home/EID
                      787G  551G  236G  70% /home/EID

I am told that there is 236G available. This is what is left for everyone to use, so don't assume that you can use it all. Mainly this is useful because if you need to upload 5G worth of files, and there is only 4G left, then you'll cause problems for other people. You can run this anywhere you happen to be since '.' means "where I currently am". You could also check another directory such as your lab's folder by replacing "." with /share/PILab.

We will monitor disk usage, and send out reports about usage. If you are asked about cleaning up, or if you want to just do so then you can also see disk usages on a per directory basis using the du command. For example,

du -kh Foo

will show me the disk usage of item Foo. If this is a directory, then this will be usage of all of the files, and sub-directories in Foo. You can see those using the command

du -ka Foo | sort -rn | more

This produces a listing of files sorted from largest to smallest. Since the -h option is not used, you will be given numbers which are file sizes in kilobytes, but at least you will be able to see the bigger items first. You can then remove them as needed.

If you are using a Mac or PC with Windows file sharing, then you can right click on the item and call up the Properties, or Get Info to have the sizes displayed to you.

Maintenance

The file server will be patched and rebooted every Wednesday (Note: Not Tuesday as it was before) from 7:00 am to 9:00 am. If you are using a cluster or UNIX system during this time you will not need to do anything specific since UNIX systems can tolerate the server rebooting. Windows, and Mac people will need to disconnect from the server. This is done by dragging the hard disk icon to the Trash, or ejecting the disk in the Finder if you are a Mac user. If you are a Windows user, you should right click the harddrive icon in My Computer and choose "Disconnect". If you do not wish to log out of the server, then please at least close Word and other applications that are actively using files on the server. Finally, SFTP users will also need to log out.