Question

Number of mapped reads from BAM file

33

Entering edit mode

9.5 years ago

Prakki Rama ★ 2.7k

Please help me find the number of mapped reads from a bam file. I was going through forums and tutorials. Looking at samtools flagstat resulted the following:

My total read count is 413,033. So, I understand 453,800 to be number of alignments. Now, is it that 391,696 reads mapped? According to this tutorial, Yes.

1. samtools flagstat file.sorted.bam

453800 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
391696 + 0 mapped (86.31%:-nan%) 0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Found another explanation at this page. It says 391,696 is the number of locations that reads mapped. A number same as flagstat is giving us above and the number of mapped reads are 413,033 same as my original fasta sequence number. Check commands below.

2. Number of Mapped locations:

samtools view -F 0x04 -c file.sorted.bam
391696

3. Number of Mapped reads:

samtools view -F 0x40 file.sorted.bam | cut -f1 | sort | uniq | wc -l
413033

Could I please know, which one, should I follow to find the number of reads that mapped the reference from my original fasta file? Suppose the reads are to be single end. I would thankful to your explanations.

bam sam reads mapping • 102k views

ADD COMMENT • link updated 22 months ago by bhanu.chandra1 ▴ 50 • written 9.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Hi!

I have run bowtie2 for the alignment of my chip-seq files, and then run samtools view to convert the sam files in bam. Also I have run samtools sort, to sort the bam file and samtools index to index the bam file.

bowtie2 -q -x mm10_genome -U file_trimmed.fq -S file.sam --no-unal
samtools view -bS -o file.bam file.sam
samtools sort -o file_sorted.bam file.bam
samtools index -b file_sorted.bam file.bai

Then I have run samtools flagstats and it resulted in an empty file, I mean, all the values are 0:

samtools flagstat file_sorted.bam 

41327440 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
41327440 + 0 mapped (100.00% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Could somebody help me with this issue?

The bowtie2 summary is correct and full of numbers of aligned and unaligned reads, so did I forget something when runing samtools? Some option or something? Because I used the default mode.

Thanks,
Iraia

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 6.2 years ago by iraia.munoa ▴ 130

Ram · Accepted Answer · 2015-04-14

38

Entering edit mode

9.5 years ago

Devon Ryan 104k

391696 is the number of mapped alignments in the BAM file. That includes multimappers, so it won't necessarily be the number of mapped reads.
This is identical to above and does the exact same thing.
This is the number of entries excluding those marked as read #1 in a pair. This is not what you want.

You have a single-end data set. What you want is the number of mapped alignment, excluding those marked as secondary or supplementary (i.e., multimappers or chimeric entries). Consequently, samtools view -F 0x904 -c file.sorted.bam should work for you. This won't work for paired-end datasets (you'd have to samtools view -F 0x4 foo.sorted.bam | cut -f 1 | sort | uniq | wc -l them).

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you for the post. Could I know what is 453,800 from flagstat? Could you please explain a bit more what is "excluding those marked as read #1 in a pair", "excluding those marked as secondary or supplementary". Would be thankful to your replies.

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by Prakki Rama ★ 2.7k

10

Entering edit mode

453800 is the total number of records (I used "entries" earlier, they're synonyms). It's equivalent to samtools view foo.bam | wc -l.

When one uses a paired-end dataset, each sequence fragment produces two reads, one originating from each end of the original fragment. When these pairs of reads are aligned together, the reads are labeled as being the first (read #1) or last (read #2) in the pair, which is useful for many types of datasets. The -F 0x40 flag to samtools view means, "exclude any record in the BAM file that has bit 0x40 set". If you read the SAM specification, you'll see that bit 0x40 indicates that a record contains read #1 of a pair. Consequently, you'll get unmapped reads, secondary alignments, and supplementary alignments, which isn't what you want.

Secondary alignments are what are commonly referred to as "multimappers". If a read (or pair) can map to more than one place, then one of those locations is termed a "primary alignment" and the others "secondary alignments". Imagine a read that maps to 10 places. If you don't exclude its secondary alignments, then you'll count it as aligning 10 times, instead of just once.

Supplemental alignments are also termed "chimeric alignments" and "non-linear alignments". It's often the case that the sample we're sequencing has structural variations when compared to the reference sequence. Imagine a 100bp read. Let us suppose that the first 50bp align to chr1 and the last 50bp to chr6. This indicates that we have a chromosomal translocation. We can't represent this type of alignment with a single entry in a SAM/BAM file, so it's written as multiple records. Again, one of these records is termed the "primary record" and the rest "supplementary". If you included the supplementary alignments you would again count a single aligned read multiple times.

So 0x904 is equivalent to 2308 in base 10. This means that bits 4 (unmapped), 256 (secondary alignment), and 2048 (supplementary alignment) are set and the "-F" option to samtools view then indicates that any record with one of these bits set is to be ignored.

Hope that helps and I recommend familiarizing yourself with the SAM specification...it's very handy.

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you very much for detailed explanation. I ll need to read through SAM specification to understand flags more deeply.

ADD REPLY • link 9.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

@Devon: Why should we exclude those marked as secondary or supplementary? I have aligned long isoseq reads in this case. So, It is better to consider all three flags 1) mapped alignment 2) multi mappers (like paralogs) 3) chimeric alignments. Because my genome is fragmented, so there is possibility that a read is broken and aligned). True. There will duplicate read IDs as you explained previously from multimappers and secondary alignments. But finally, I can sort and unique the read IDs which gives me number of reads that are mapped from my original fasta file. Thanks in advance.

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by Prakki Rama ★ 2.7k

1

Entering edit mode

You asked for the number of reads aligned. Thus, if a read aligns multiple times, you only want to count it once. What I described above will do that and be faster than sorting the read names and running them through uniq.

ADD REPLY • link 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

Hi Devon. Thank you very much for your directions. I was trying to familiarize with SAM/BAM files. I was looking at the BAM file with paired end read mapping. If both reads in a pair (read1 and read2) are mapped, then I see same header (column1 in BAM) for both the reads. In such case as you mentioned above to get the mapped reads for paired data, samtools view -F 0x4 foo.sorted.bam | cut -f 1 | sort | uniq | wc -l would not work right? Because the headers are sorted and uniqued. Do I miss something here?

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by Prakki Rama ★ 2.7k

3

Entering edit mode

That should work.

Edit: I should add that I'm assuming you only care if at least one mate in the pair mapped and therefore want the pair counted once. If you want each read in the pair counted individually then use the -c option to samtools view and do the filtering according to flags.

ADD REPLY • link 9.5 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you very much devon. It was very helpful.

ADD REPLY • link 9.5 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Hi Devon, Could you please elaborate on the command to be used along with -c option in samtools.

I have xenograft samples where I performed alignment on a combined genome (human and mouse) for paired-end reads (RNAseq) using STAR tool. Now, I want to look for mapped reads (potentially store them in a new file) only with their primary alignment. Unfortunately, haven't succeeded !! Secondly, in a PE read, one read can map to one chromosome and another read can map to a different chromosome. I want to eliminate all such read pairs where the two reads of a pair map to different chromosomes (either between different chromosomes of human or different chromosomes of mouse or one read mapping to a human chromosome and another to a mouse chromosome). I will greatly appreciate any help on how to solve these problems.

ADD REPLY • link 7.1 years ago by kaursm • 0

0

Entering edit mode

The solution doesn't scale well for the situation when all alignments for a read are output and many reads have multiple valid alignments. Some tools, such as STAR, output text files of read mapping statistics, but others, such as Bowtie, don't save any such files and unfortunately require calculating the number of reads with samtools.

ADD REPLY • link 7.2 years ago by dario.garvan ▴ 520

0

Entering edit mode

Dear Ryan,

I wanted to apply your logic for counting uniquely mapped reads within paired-end datasets (STAR). My goal is, for a given location, to count the number of reads on forward and reverse strand (also using advices given here).

First, if I disregard the fact that my dataset is paired (samtools 1.9) :

samtools view -c paired.bam chr12:113429300-113429319 # count all
samtools view -c -F 20 paired.bam chr12:113429300-113429319 # exclude unmapped & reverse strand <=> keep mapped forward
samtools view -c -f 16 paired.bam chr12:113429300-113429319 # keep reverse strand

Results :
6481
2474
4007
=> 2474 + 4007 = 6481, everything is fine.

Now if I try to do it the right way for my paired dataset

samtools view paired.bam chr12:113429300-113429319 | cut -f 1 | sort | uniq | wc -l;
samtools view -F 20 paired.bam chr12:113429300-113429319 | cut -f 1 | sort | uniq | wc -l;
samtools view -f 16 paired.bam chr12:113429300-113429319 | cut -f 1 | sort | uniq | wc -l;

Results :
4492
2307
3982
=> 2307 + 3982 != 4492.

I am stuck here and cannot figure it out. Any tips ?

ADD REPLY • link 6.1 years ago by erwan.scaon ▴ 950

0

Entering edit mode

Hi, did you find an answer to the mentioned discrepancy?

ADD REPLY • link 22 months ago by bhanu.chandra1 ▴ 50

0

Entering edit mode

Hello, Sorry it's been a long time and I cannot remember if this was solved or not.