Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly. In one of our previous tutorials, we mapped reads with TopHat and obtained BAM files. However, before we can use these BAM files in downstream analysis, we need to learn basic and more advanced operations which allows to deal with the file, filter them and pre-process. In this tutorial, we explain how to manipulate with BAM files with samtools - an excellent suite of bioinformatics commands which allows various operations on SAM/BAM files.
SAM file format SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with '@', while alignment lines do not.
A full description of the SAM format can be found here. SAM aims to be a format that:
Is flexible enough to store all the alignment information generated by various alignment programs; Is simple enough to be easily generated by alignment programs or converted from existing alignment formats; Is compact in file size; Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory; Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus. Example of the SAM file:
Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information:
Read Name SAM flag chromosome (if read is has no alignment, there will be a "*" here) position (1-based index, "left end of read") MAPQ (mapping quality - describes the uniqueness of the alignment, 0=non-unique, >10 probably unique) CIGAR string (describes the position of insertions/deletions/matches in the alignment, encodes splice junctions, for example) Name of mate (mate pair information for paired-end sequencing, often "=") Position of mate (mate pair information) Template length (always zero for me) Read Sequence Read Quality Program specific Flags (i.e. AS is an alignment score, NH is a number of reported alignments that contains the query in the current record) Converting BAM to SAM and vice versa 'samtools view' command allows you to convert an unreadable alignment in binary BAM format to a human readable SAM format. Download the data we obtained in the TopHat tutorial on RNA expression in human brain. You can submit samtools via idna_submit.py wrapper. Remember to change the path /data/userXXX/ to your user ID which you can find in the header of the console:
idna_submit.py -t bam2sam -c 1 -r 1.7 -e 'idna_samtools_view -h /data/userXXX/out/accepted_hits.bam > /data/userXXX/bam_manipulating/accepted_hits.sam'