Question: Reads Per Contig From Sam File?
gravatar for Newvin
8.9 years ago by
Newvin340 wrote:

After recently assembling a transcriptome, I took the reads used to generate the assembly and aligned them to the assembled contigs using BWA. The output of this process is a SAM file. I'd like to parse this SAM file to get a list of the # of reads map to each assembled contig. Before rolling my own script, I wanted to see if anyone is aware of a tool or easy method by which this information can be extracted.


read alignment contigs sam parsing • 11k views
ADD COMMENTlink written 8.9 years ago by Newvin340
gravatar for Steve Lianoglou
8.9 years ago by
Steve Lianoglou5.1k
Steve Lianoglou5.1k wrote:

Convert your SAM file to an (ordered) BAM file, index it, then run samtools idxstats yourreads.bam.

The output from the first few lines of one of my hg19-aligned reads looks like so:

$ samtools idxstats reads.bam | head -n 5
chr1  249250621  675890  0
chr2  243199373  385641  0
chr3  198022430  262890  0
chr4  191154276  253843  0
chr5  180915260  370789  0

The first three columns are: contig name, contig length, # of reads aligned to that contig.

ADD COMMENTlink modified 7.6 years ago • written 8.9 years ago by Steve Lianoglou5.1k

Hello, (hopefully you will see this comment I know it's been like 8 years now since your answer was posted) but I wanted to ask you how did you set up your bwa-mem reference file so that the contigs (chromosomes in your case) had different lengths?

I'm running into a weird situation where I have a reference file with various CDS sequences all different lengths, yet when I run the samtools idxstats sp1.bam | head -n 5 command, the contigs are all the same length and always the length is for the longest CDS sequence in the file. Any ideas??

ADD REPLYlink written 12 months ago by DNAngel60
gravatar for Mikael Huss
8.9 years ago by
Mikael Huss4.7k
Mikael Huss4.7k wrote:

The contig ID is presumably in column 3 in the SAM file, so something like

cut -f3 file.sam | sort | uniq -c

should suffice, unless I misunderstand your question.

Now, this doesn't consider mapping qualities, SAM flags and so on, of course, but you can add a samtools call to the beginning and pipe the output into the "cut" command.

EDIT: If you have a BAM file (and a BAM index file) you can use "samtools idxstats" instead.

ADD COMMENTlink written 8.9 years ago by Mikael Huss4.7k

"cut -f3 file.sam | sort | uniq -c ".

Just for clarification, Does not this command give the length of the contig?

ADD REPLYlink written 8.6 years ago by Prakki Rama2.4k

I use this all the time for sam files (pretty much the same as Mikael suggested) grep -v "^@" file.sam | cut -f3 | sort | uniq -c | awk '{print $1 " " $2}'

  • You can pipe a sort after the awk to print by # matches: | sort -nk 1
ADD REPLYlink written 7.6 years ago by kmhernan840

Ah, I see what you mean. I was probably thinking of a SAM file without a header, which would produce only the desired output (number of reads per contig). However, if the SAM file has a header section, this would unfortunately also pick up the contig lengths from the header. It is easy to get around this as well on the command line, but probably simpler and safer to just use samtools idxstats or any of the other tools mentioned here. Sorry about that.

ADD REPLYlink written 8.6 years ago by Mikael Huss4.7k
gravatar for Ahill
8.8 years ago by
United States
Ahill1.9k wrote:

Agree with the above answers. But if you are working with BAM files generated using older versions of samtools (pre r595, circa June 2010), be aware that "samtools idxstats" and Picard's BamIdxStats will not correctly count reads. For example BamIndexStats reports zero reads:

$ java -jar $AQG/OpenSource/picard/BamIndexStats.jar INPUT=NA12878.chrom1.ILLUMINA.bwa.CEU.high_coverage.20100311.bam
1 length=       249250621       Aligned= 0      Unaligned= 0

instead of the actual 261,812,901 reads:

$ samtools view -c NA12878.chrom1.ILLUMINA.bwa.CEU.high_coverage.20100311.bam '1'
ADD COMMENTlink written 8.8 years ago by Ahill1.9k
gravatar for Gaffa
8.9 years ago by
Gaffa490 wrote:

As an alternative to samtools idxstats (answer given by Steve Lianoglou), Picard's BamIndexStats provides the same functionality.

Example output:

chr1 length=    230218  Aligned= 52377  Unaligned= 1962
chr2 length=    813184  Aligned= 179139 Unaligned= 4749
chr3 length=    316620  Aligned= 76014  Unaligned= 2106

ADD COMMENTlink modified 12 months ago by RamRS30k • written 8.9 years ago by Gaffa490
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1249 users visited in the last hour