Question: Difference Between Mapped And Unmapped Reads
gravatar for Ric
8.9 years ago by
Ric170 wrote:


I aligned paired-end files with BWA. BWA's sam file contain all reads, whether they hit the reference or not. With the following command I filtered only reads which hit the reference.

$ bwa sampe <reference.fasta> F3.sai F5.sai F3.fastq F5.fastq | \
awk '{if (substr($1,1,1)=="@") {print $0} else {if ($3!="*") {print $0}}}' > aln_hit_only.sam

Now, I am confused what mapped reads and unmapped reads mean, because of running

$ samtools idxstats aln_hit_only.bam

which give me the number for both mapped and unmapped reads numbers. I would expect that unmapped reads are reads which do not hit the reference.

What is the difference between mapped and unmapped reads?

Thank you in advance.

paired samtools bwa • 15k views
ADD COMMENTlink modified 3.1 years ago by Biostar ♦♦ 20 • written 8.9 years ago by Ric170
gravatar for David Langenberger
8.9 years ago by
David Langenberger9.6k wrote:

Just a short comment:

To check whether the segment was mapped or not, can be checked much easier, since this information is saved in the bitwise FLAG (column 2 of a sam entry).

To get the number of all mapped entries:

samtools view -S -F0x4 aln_complete.sam | wc -l

To get the number of unmapped reads:

samtools view -S -f0x4 aln_complete.sam | wc -l
ADD COMMENTlink written 8.9 years ago by David Langenberger9.6k

More efficient: samtools view -c -F0x4 aln_complete.sam

ADD REPLYlink written 8.1 years ago by Konrad Rudolph140

Does it work with sam files? I think only with bam. But you're right. It's more efficient. One more comment here! For the mapped reads one should unique the reads mapped to multiple loci, otherwise the sum of mapped and unmapped might be greater than the number of all: "samtools view -S -F0x4 aln_complete.sam | cut -f1 | sort | uniq | wc -l"

ADD REPLYlink written 8.1 years ago by David Langenberger9.6k

More efficient? If you have already indexed the bam file, idxstats is far far more efficient. view has to scan through every read.

In answer to the original question: idxstats outputs "chorm chrom_size mapped_reads unmapped_reads". Unmapped reads who have a mate mapped are assigned to the same chrom. Unmapped reads with no mate or an unmapped mate are assigned to chrom "*"

ADD REPLYlink written 7.0 years ago by travcollier160
gravatar for Swbarnes2
8.9 years ago by
Swbarnes21.5k wrote:

Unmapped reads are given the mapping coordinates of their mapped mate. It's in the samtools specs, and that's what bwa does. Feature, not bug.

So your awk statement won't do what you want it to do. You have to rely on the binary flag.

samtools view will filter a .bam based on the binary flag, so use that. Reads with a 4 techincally unmapped, regardless of any other info in the line, like a mapping coordinate, or a CIGAR string, etc.

Also, specific to bwa, if your read hangs off of one reference sequence onto another, it will be given an appropriate mapping position, based on where the read starts, but the unmapped flag will also be set. Feature, not bug.

ADD COMMENTlink written 8.9 years ago by Swbarnes21.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1713 users visited in the last hour