A healthy taste of resources available, specifically for this course - not a comprehensive catalog.
...
- Overviews
Technology intros
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
- Newer single molecule sequencing
- Single cell sequencing
- Older technologies (less common now)
Life Technologies SOLiD (short reads in "colorspace")
Roche/454 – long (multi-Kb) reads often used in assemblies
- Illumina (Solexa) – most common "short" (< 300 bp) read sequencing
...
- Comparison of different aligners
- by Heng Li, developer of bwa, samtools, and many other bioinformatics tools
- File formats
- input: FASTQ format
- output: the SAM (Sequence Alignment Map) format specification (SAM1.pdf)
- Aligners
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- fast, sensitive and easy to use
- bowtie2 – http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
- fast, sensitive and extremely configurable
- bwa (Burrows-Wheeler Aligner) by Heng Li – http://bio-bwa.sourceforge.net/
- The BioITeam has some TACC-aware alignment scripts you might find useful:
- bwa alignment
/
work/projects/BioITeam/common/script
/align_bwa_illumina.sh
- bowtie2 alignment
/
work/projects/BioITeam/common/script/
align_bowtie2_illumina.sh
- merging sorted BAM files (read-group aware)
/
work/projects/BioITeam/common/script/
merge_sorted_bams.sh
- email or come talk to Anna if you have questions or problems
- bwa alignment
...
- SAM (Sequence Alignment Map) format specification (SAM1.pdf)
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- type in a decimal number to see which flags are set
- Translate SAM file flags web calculator: http://broadinstitute.github.io/picard/explain-flags.html
- samtools – by Heng Li
- SAM/BAM conversion, flag filtering, sorting, indexing, duplicate filtering
- older 0.1.xx versions: http://samtools.sourceforge.net/
- newer 1.x3+ versions: http://www.htslib.org/
- Picard toolkit – http://broadinstitute.github.io/picard/
- SAM/BAM utilities that are read-group aware
- especially MarkDuplicatesand MarkDuplicatesWithMateCigar for flagging duplicate alignments
- SAMStat - http://samstat.sourceforge.net/
- produces detailed graphical statistics for SAM/BAM files.
- bedtools – http://bedtools.readthedocs.org/en/latest/
- All sub-commands: http://bedtools.readthedocs.io/en/latest/content/bedtools-suite.html
- Swiss army knife for all manner of common BED, BAM, VCF, GFF/GTF file manipulation.
- See BEDTools Overview for some common use cases.
- Available in the TACC module system
- SAMStat - http://samstat.sourceforge.net/
- produces detailed graphical statistics for SAM/BAM files.
File formats and conversion
- SAM format specification – http://samtools.github.io/hts-specs/SAMv1.pdf
- crucial for performing format conversions, of which ChIP-seq analysis can have many
- Genome browser file formats – http://genome.ucsc.edu/FAQ/FAQformat.html
- BED, bedGraph, narrowPeak and many more
- SRA (Sequence Read Archive) from NCBI
- overview on this wiki
- SRA search home page
- SRA Toolkit
- UCSC file format conversion scripts - useful for getting to/from WIG and BED to corresponding binary formats.
- Make sure you download the correct scripts for your operating system!
- Directories containing these tools can be found at TACC:
-
/work/projects/BioITeam/common/opt/UCSC_utils.20132019_0309
/work/projects/BioITeam/common/opt/UCSC_utils.2017_07
-
- Mason program for simulating NGS sequencing reads
UCSC Genome Browser
- Main UCSC Genome Browser web site
- File formats - BED format especially is widely used
- Table browser - Browse and download data in different formats
- ENCODE data downloads at UCSC - useful for getting data to work with
- Beta Test browser site - most up-to-date datasets and features; can be buggy
- Visualize mapped data at UCSC genome browser on this wiki
...
- Broad institute GATK (Genome Analysis Tool Kit) – https://software.broadinstitute.org/gatk/documentation/
- complex but powerful
- used by TCGA, 1000 Genomes
- File formats
- VCF (Variant Call Format) v4.0 - initially developed by 1000 Genomes project
- MAF (Mutation Annotation Format) – developed by The Cancer Genome Atlas (TCGA)
- The International Genome Sample Resource – follow-on to the 1000 Genomes project
- catalog of human genetic variants
- Dan Deatherage's Genome Variant Analysis CCBB summer school course
Genome Annotation
- GO – http://geneontology.org/
- The Gene Ontology resource, a large source of information on the functions of genes
- GOrilla – http://cbl-gorilla.cs.technion.ac.il/
- Gene Ontology enRIchment anaLysis and visuaLizAtion tool
- DAVID – https://david.ncifcrf.gov/
- functional annotation from user-supplied gene lists
- GREAT – http://bejerano.stanford.edu/great/public/html/
analysis tool that splash.php- Genomic Regions Enrichment of Annotations Tool
- takes bed files as input and outputs enriched genes, GO-terms, motifs, etc.
- human, mouse, zebrafish
- MEME-suite – http://meme-suite.org/
- a motif identification and discovery tool. Works with most species.
- takes FASTA files as input
- filter your BAM/BED files to get the regions of interest
- then convert to FASTA using bedtools bamtofastq.
...