Environment Variables
Linux has lots of variables in your default environment. Variables in Linux start with a '$' and, by convention, they frequently use capital letters (but that is not required). You can also create variables in Linux to use later, which is a convenient trick that we will use. Try the following commands:
List all the variables currently in your environment:
env
Print the "HOME" variable to see what it contains:
echo $HOME
Notice that you have to use a $ for an existing variable!
What do you think the WORK variable contains?
Try this:
cd $WORK pwd cd $HOME pwd echo $PWD
HOME and WORK are "convenience variables" because they are short, easy to remember, and faster than typing out the full directory structure. The command "pwd" prints the working directory, but there is also a variable called $PWD that is set to your present working directory too.
I put some example data in a folder for you, but you probably don't want to type in that in every time. Let's create a new environment variable.
export TEST_DATA=/scratch/01114/jfonner/training ls $TEST_DATA
Notice that '$' is a prefix only to existing variables. When creating a new variable, we don't need the '$'. When referencing a variable we already created, we need the '$' first.
One Line Scripts
Life scientists often get stuck trying to do simple searching, sorting, and filtering of data if there isn't a tool that already does exactly what is needed. Let's explore some built in tools in Linux:
head $TEST_DATA/250k_reads.fastq grep -A 1 '^@M00' $TEST_DATA/250k_reads.fastq | head -20 head -100000 $TEST_DATA/250k_reads.fastq | grep -A 1 '^@M00' | head head -100000 $TEST_DATA/250k_reads.fastq | grep -A 1 '^@M00' | grep -v '^@M00' | head head -100000 $TEST_DATA/250k_reads.fastq | grep -A 1 '^@M00' | grep -v '^@M00' | grep -v '^--$' | head head -100000 $TEST_DATA/250k_reads.fastq | grep -A 1 '^@M00' | grep -v '^@M00' | grep -v '^--$' | sort | head head -100000 $TEST_DATA/250k_reads.fastq | grep -A 1 '^@M00' | grep -v '^@M00' | grep -v '^--$' | sort | uniq -c | head head -100000 $TEST_DATA/250k_reads.fastq | grep -A 1 '^@M00' | grep -v '^@M00' | grep -v '^--$' | sort | uniq -c | sort -n -r | head
If you don't know what a command did, try using the "man" command, for example "man uniq"
The above example isn't particularly efficient, but it gets the job done. In the above examples, we heavily abuse the Linux "pipe," which is the symbol '|' (probably shares the '\' key on your keyboard above the Enter key). Pipe takes the "standard output" that would normally go to the screen and sends it to the next command.
If we really like this output, we can save it to a file and look through it rather than just printing the first 10 lines:
head -100000 $TEST_DATA/250k_reads.fastq | grep -A 1 '^@M00' | grep -v '^@M00' | grep -v '^--$' | sort | uniq -c | sort -n -r > my_file.txt less my_file.txt
Less lets you look through files. Up and down arrows move one line at a time. 'F' pages down, and 'W' pages up. "Shift+G" takes you to the end of the file, and "G" take you back to the beginning. "Q" exits the program.
Further reading and examples: Scott's list of linux one-liners
Software Modules
TACC has lots of software packages, but most of them are not in your environment. TACC doesn't want to "bloat" your environment with a bunch of software that you don't use, so we put everything in modules. Try this:
module module list module avail module key genomics module show samtools module load samtools module list module unload samtools module list
If you have software that you use frequently, you can save all the modules that you currently have loaded as the default
module save
Now, let's use some of our "one-liner" skills to see how many software modules there are at TACC related to life sciences. For learning purposes (and it's also a common way to operate), let's build up our command one step at a time:
module key biology chemistry genomics 2>&1 module key biology chemistry genomics 2>&1 | grep -v " " module key biology chemistry genomics 2>&1 | grep -v " " | grep ":" module key biology chemistry genomics 2>&1 | grep -v " " | grep ":" | grep "^ " module key biology chemistry genomics 2>&1 | grep -v " " | grep ":" | grep "^ " | cut -d ':' -f 2 -s module key biology chemistry genomics 2>&1 | grep -v " " | grep ":" | grep "^ " | cut -d ':' -f 2 -s | tr ',' '\n' module key biology chemistry genomics 2>&1 | grep -v " " | grep ":" | grep "^ " | cut -d ':' -f 2 -s | tr ',' '\n' | wc -l
Can you come up with your own way of doing this that uses fewer steps?
Finally, AWK gives us tons of ability to format text. AWK is really a little programming language. It's big utility is that when it reads a line, it assigns to the line to the variable $0, and then each "field" is assigned to variables $1, $2, $3, etc. Look at this one-liner:
module keyword genomics 2>&1 | grep -v '^[A-Za-z0-9]' | grep -v '^---' | grep -v spider | grep -v '^$' | grep -v " " | sed s/','/' '/g | awk 'BEGIN {print "Module\t\tVersions\nList Updated\t"strftime("%B %d %Y",systime())", "} { prog=$1; prog_vers=$2 "\t" $3 "\t" $4; if (length(prog) < 8) prog="prog\t"; print prog "\t" prog_vers ; }'