Reads Per Contig From Sam File?
4
8
Entering edit mode
11.2 years ago
Newvin ▴ 360

After recently assembling a transcriptome, I took the reads used to generate the assembly and aligned them to the assembled contigs using BWA. The output of this process is a SAM file. I'd like to parse this SAM file to get a list of the # of reads map to each assembled contig. Before rolling my own script, I wanted to see if anyone is aware of a tool or easy method by which this information can be extracted.

Thanks!

sam parsing alignment read contigs • 14k views
ADD COMMENT
18
Entering edit mode
11.2 years ago

Convert your SAM file to an (ordered) BAM file, index it, then run samtools idxstats yourreads.bam.

The output from the first few lines of one of my hg19-aligned reads looks like so:

$ samtools idxstats reads.bam | head -n 5
chr1  249250621  675890  0
chr2  243199373  385641  0
chr3  198022430  262890  0
chr4  191154276  253843  0
chr5  180915260  370789  0

The first three columns are: contig name, contig length, # of reads aligned to that contig.

ADD COMMENT
0
Entering edit mode

Hello, (hopefully you will see this comment I know it's been like 8 years now since your answer was posted) but I wanted to ask you how did you set up your bwa-mem reference file so that the contigs (chromosomes in your case) had different lengths?

I'm running into a weird situation where I have a reference file with various CDS sequences all different lengths, yet when I run the samtools idxstats sp1.bam | head -n 5 command, the contigs are all the same length and always the length is for the longest CDS sequence in the file. Any ideas??

ADD REPLY
3
Entering edit mode
11.2 years ago

The contig ID is presumably in column 3 in the SAM file, so something like

cut -f3 file.sam | sort | uniq -c

should suffice, unless I misunderstand your question.

Now, this doesn't consider mapping qualities, SAM flags and so on, of course, but you can add a samtools call to the beginning and pipe the output into the "cut" command.

EDIT: If you have a BAM file (and a BAM index file) you can use "samtools idxstats" instead.

ADD COMMENT
0
Entering edit mode

"cut -f3 file.sam | sort | uniq -c ".

Just for clarification, Does not this command give the length of the contig?

ADD REPLY
0
Entering edit mode

I use this all the time for sam files (pretty much the same as Mikael suggested) grep -v "^@" file.sam | cut -f3 | sort | uniq -c | awk '{print $1 " " $2}'

  • You can pipe a sort after the awk to print by # matches: | sort -nk 1
ADD REPLY
0
Entering edit mode

Ah, I see what you mean. I was probably thinking of a SAM file without a header, which would produce only the desired output (number of reads per contig). However, if the SAM file has a header section, this would unfortunately also pick up the contig lengths from the header. It is easy to get around this as well on the command line, but probably simpler and safer to just use samtools idxstats or any of the other tools mentioned here. Sorry about that.

ADD REPLY
3
Entering edit mode
11.2 years ago
Ahill ★ 1.9k

Agree with the above answers. But if you are working with BAM files generated using older versions of samtools (pre r595, circa June 2010), be aware that "samtools idxstats" and Picard's BamIdxStats will not correctly count reads. For example BamIndexStats reports zero reads:

$ java -jar $AQG/OpenSource/picard/BamIndexStats.jar INPUT=NA12878.chrom1.ILLUMINA.bwa.CEU.high_coverage.20100311.bam
1 length=       249250621       Aligned= 0      Unaligned= 0

instead of the actual 261,812,901 reads:

$ samtools view -c NA12878.chrom1.ILLUMINA.bwa.CEU.high_coverage.20100311.bam '1'
261812901
ADD COMMENT
2
Entering edit mode
11.2 years ago
Gaffa ▴ 500

As an alternative to samtools idxstats (answer given by Steve Lianoglou), Picard's BamIndexStats provides the same functionality.

Example output:

chr1 length=    230218  Aligned= 52377  Unaligned= 1962
chr2 length=    813184  Aligned= 179139 Unaligned= 4749
chr3 length=    316620  Aligned= 76014  Unaligned= 2106

http://picard.sourceforge.net/command-line-overview.shtml#BamIndexStats

ADD COMMENT

Login before adding your answer.

Traffic: 2052 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6