Reads Per Contig From Sam File?
4
8
Entering edit mode
11.4 years ago
Newvin ▴ 360

After recently assembling a transcriptome, I took the reads used to generate the assembly and aligned them to the assembled contigs using BWA. The output of this process is a SAM file. I'd like to parse this SAM file to get a list of the # of reads map to each assembled contig. Before rolling my own script, I wanted to see if anyone is aware of a tool or easy method by which this information can be extracted.

Thanks!

sam parsing alignment read contigs • 14k views
18
Entering edit mode
11.4 years ago

Convert your SAM file to an (ordered) BAM file, index it, then run samtools idxstats yourreads.bam.

The output from the first few lines of one of my hg19-aligned reads looks like so:

samtools idxstats reads.bam | head -n 5 chr1 249250621 675890 0 chr2 243199373 385641 0 chr3 198022430 262890 0 chr4 191154276 253843 0 chr5 180915260 370789 0  The first three columns are: contig name, contig length, # of reads aligned to that contig. ADD COMMENT 0 Entering edit mode Hello, (hopefully you will see this comment I know it's been like 8 years now since your answer was posted) but I wanted to ask you how did you set up your bwa-mem reference file so that the contigs (chromosomes in your case) had different lengths? I'm running into a weird situation where I have a reference file with various CDS sequences all different lengths, yet when I run the samtools idxstats sp1.bam | head -n 5 command, the contigs are all the same length and always the length is for the longest CDS sequence in the file. Any ideas?? ADD REPLY 3 Entering edit mode 11.4 years ago The contig ID is presumably in column 3 in the SAM file, so something like cut -f3 file.sam | sort | uniq -c should suffice, unless I misunderstand your question. Now, this doesn't consider mapping qualities, SAM flags and so on, of course, but you can add a samtools call to the beginning and pipe the output into the "cut" command. EDIT: If you have a BAM file (and a BAM index file) you can use "samtools idxstats" instead. ADD COMMENT 0 Entering edit mode "cut -f3 file.sam | sort | uniq -c ". Just for clarification, Does not this command give the length of the contig? ADD REPLY 0 Entering edit mode I use this all the time for sam files (pretty much the same as Mikael suggested) grep -v "^@" file.sam | cut -f3 | sort | uniq -c | awk '{print1 " " $2}' • You can pipe a sort after the awk to print by # matches: | sort -nk 1 ADD REPLY 0 Entering edit mode Ah, I see what you mean. I was probably thinking of a SAM file without a header, which would produce only the desired output (number of reads per contig). However, if the SAM file has a header section, this would unfortunately also pick up the contig lengths from the header. It is easy to get around this as well on the command line, but probably simpler and safer to just use samtools idxstats or any of the other tools mentioned here. Sorry about that. ADD REPLY 3 Entering edit mode 11.3 years ago Ahill ★ 1.9k Agree with the above answers. But if you are working with BAM files generated using older versions of samtools (pre r595, circa June 2010), be aware that "samtools idxstats" and Picard's BamIdxStats will not correctly count reads. For example BamIndexStats reports zero reads: $ java -jar AQG/OpenSource/picard/BamIndexStats.jar INPUT=NA12878.chrom1.ILLUMINA.bwa.CEU.high_coverage.20100311.bam 1 length= 249250621 Aligned= 0 Unaligned= 0  instead of the actual 261,812,901 reads:  samtools view -c NA12878.chrom1.ILLUMINA.bwa.CEU.high_coverage.20100311.bam '1'
261812901

2
Entering edit mode
11.4 years ago
Gaffa ▴ 500

As an alternative to samtools idxstats (answer given by Steve Lianoglou), Picard's BamIndexStats provides the same functionality.

Example output:

chr1 length=    230218  Aligned= 52377  Unaligned= 1962
chr2 length=    813184  Aligned= 179139 Unaligned= 4749
chr3 length=    316620  Aligned= 76014  Unaligned= 2106


http://picard.sourceforge.net/command-line-overview.shtml#BamIndexStats