A healthy taste of resources available, specifically for this course - not a comprehensive catalog.
...
- SEQAnwers forum - many NGS sequencing questions answered here
- A funny SEQAnwers post about biologists starting to analyze NGS data: http://seqanswers.com/forums/showthread.php?t=4589
- UCSC Genome Browser - visualize and download NGS data (see more below)
- Galaxy website for online sequencing data analysis
- Broad Institute Integrated Genomcs Viewer (IGV)
- especially good for visualizing bam BAM file details
- Michigan State University ANGUS resources
- A list of their tutorials: http://ged.msu.edu/angus/
- 2012 Next-Gen Sequence Analysis Workshop a similar tutorial to our course
- Introduction to Sequence analysis in the Amazon EC2 cloud
- where you can "rent" Linux machines (useful if you don't have access to TACC)
...
- Overviews
Technology intros
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
- Newer "single molecule" sequencing
- "Single cell" sequencing
- Older technologies (less common now)
Life Technologies SOLiD (short reads in "colorspace")
Roche/454 – long (mult-Kb) reads often used in assemblies
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
Fastq analysis/manipulation/QC
- Wikipedia FASTQ format page
- Illumina library construction on GSAF user wiki - useful for contaminant detection or adapter removal
- FastQC from Babraham Bioinformatics – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- produces nice quality report for fastq files
- MultiQC – http://multiqc.info/
- A great tool for consolidating QC multiple QC reports into one HTML page
- Anna's tutorial on using MultiQC – https://wikis.utexas.edu/display/bioiteam/Using+MultiQC
- cutadapt – https://cutadapt.readthedocs.io/en/stable/
- An excellent command line tool for adapter sequence removal
- Good support for trimming paired-end datasets
- Available at TACC at
/work/projects/BioITeam/ls5/opt/cutadapt-1.10/cutadapt
- also needs this $PYTHONPATH modification:
export PYTHONPATH="/work/projects/BioITeam/ls5/lib/python2.7/site-packages:$PYTHONPATH"
- also needs this $PYTHONPATH modification:
- Script that handles the details of paired-end read trimming
/work/projects/BioITeam/common/script/trim_adapters.sh
- trimmomatic – http://www.usadellab.org/cms/?page=trimmomatic
- Supports trimming paired-end datasets. I haven't used it but it seems to be popular.
- fastx toolkit – http://hannonlab.cshl.edu/fastx_toolkit/
- Command line tools for fastq analysis and manipulation
- Good for hard clipping. Available at TACC.
- Documentation at: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
...
- Comparison of different aligners
- by Heng Li, developer of BWA, samtools, and many other
- File formats
- Aligners
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- fast, sensitive and easy to use
- bowtie2 – http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
- fast, sensitive and extremely configurable
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- The BioITeam has some TACC-aware alignment scripts you might find useful:
- bwa alignment
/
work/projects/BioITeam/common/script
/align_bwa_illumina.sh
- bowtie2 alignment
/
work/projects/BioITeam/common/script/
align_bowtie2_illumina.sh
- merging sorted BAM files (read-group aware)
/
work/projects/BioITeam/common/script/
merge_sorted_bams.sh
- email or come talk to me if you have questions or problems
- bwa alignment
...
- SAM (Sequence Alignment Map) format specification (SAM1.pdf)
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- type in a decimal number to see which flags are set
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- samtools – by Heng Li
- sam/bam conversion, flag filtering, sorting, indexing, duplicate filtering
- 0.1.xx versions: http://samtools.sourceforge.net/
- 1.x+ versions: http://www.htslib.org/
- Picard toolkit – http://broadinstitute.github.io/picard/
- sam/bam utilities that are read-group aware
- especially MarkDuplicates and MarkDuplicatesWithMateCigar for flagging duplicate alignments
- SAMStat - http://samstat.sourceforge.net/
- produces detailed graphical statistics for sam/bam files.
- bedtools – http://bedtools.readthedocs.org/en/latest/
- All sub-commands: http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
- Swiss army knife for all manner of common bed, bam, vcf, gff file manipulation such as:
- intersecting bam or bed with annotation files
- bedtools intersect (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html)
- bedtools intersect (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html)
- merging overlapping regions
- bedtools merge (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html)
- bedtools merge (http://bedtools.readthedocs.io/en/latest/content/tools/merge.html)
- generation of per-base genome-wide signal in bedGraph format
- bedtools coverage(http://bedtools.readthedocs.io/en/latest/content/tools/coverage.html)
- bedtools multicov(http://bedtools.readthedocs.io/en/latest/content/tools/multicov.html)
- extracting fasta corresponding to regions
- bedtools getfasta (http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html)
- format conversion
- bedtools bamtobed (http://bedtools.readthedocs.io/en/latest/content/tools/bamtobed.html)
- bedtools bamtofastq (http://bedtools.readthedocs.io/en/latest/content/tools/bamtofastq.html)
- bedtools bedtobam (http://bedtools.readthedocs.io/en/latest/content/tools/bedtobam.html)
- intersecting bam or bed with annotation files
- Available in the TACC module system
File formats and conversion
- SAM format specification – http://samtools.github.io/hts-specs/SAMv1.pdf
- crucial for performing format conversions, of which ChIP-seq analysis can have many
- Genome browser file formats – http://genome.ucsc.edu/FAQ/FAQformat.html
- BED, bedGraph, narrowPeak and many more
- SRA (Sequence Read Archive) from NCBI
- overview on this wiki
- SRA search home page
- SRA Toolkit
- UCSC file format conversion scripts - useful for getting to/from wig and bed to corresponding binary formats.
- Make sure you download the correct script for your operating system!
- Directories containing these tools can be found on ls5 at
-
/work/projects/BioITeam/common/opt/UCSC_utils.2013_03
/work/projects/BioITeam/common/opt/UCSC_utils.2017_07
-
- Mason program for simulating NGS sequencing reads
...
- Tools
- Broad institute GATK - complex but powerful; used by TCGA, 1000 Genomes
- documentation page: https://software.broadinstitute.org/gatk/documentation/
- Broad institute GATK - complex but powerful; used by TCGA, 1000 Genomes
- File formats
- VCF (Variant Call Format) v4.0 - initially developed by 1000 Genomes project
- MAF (Mutation Annotation Format) – developed by The Cancer Genome Atlas (TCGA)
- The International Genome Sample Resource – follow-on to the 1000 Genomes project
- catalog of human genetic variants
- Dan Deatherage's Genome Variant Analysis CCBB summer school course
Genome Annotation
- DAVID – DAVID – https://david.ncifcrf.gov/
- functional annotation from user-supplied gene lists
- GREAT – http://bejerano.stanford.edu/great/public/html/
- analysis tool that takes bed files as input and outputs enriched genes, GO-terms, motifs, etc.
- human, mouse, zebrafish
- analysis tool that takes bed files as input and outputs enriched genes, GO-terms, motifs, etc.
- MEME-suite – http://meme-suite.org/
- a motif identification and discovery tool. Works with most species.
- takes fasta files as input, so
- filter your bam/bed files to get the regions of interest
- then convert
- to fasta using bedtools bamtofastq
- .