filtering paired end mapped reads form SAM/BAM file
1
0
Entering edit mode
9.6 years ago
Raghav ▴ 100

Dear All,

I performed paired end mapping of transcriptome reads (genrated on illumina platform) by bowtie2. after mapping I used Samtools for calculation of mapping statistics by using samtools flagstat. and output is

68714536 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
54522137 + 0 mapped (79.35%:-nan%)
68714536 + 0 paired in sequencing
34357268 + 0 read1
34357268 + 0 read2
44630360 + 0 properly paired (64.95%:-nan%)
46641928 + 0 with itself and mate mapped
7880209 + 0 singletons (11.47%:-nan%)
168844 + 0 with mate mapped to a different chr
145423 + 0 with mate mapped to a different chr (mapQ>=5)

How to interpret this mapping output.

Apart from this, I want to filter only those mapped reads which are paired properly. For that purpose I used samtools again

samtools view -f 1 -F 12 <bamfile> > pairedendmapps.bam

But I am not sure whether it working correctly or not.

Waiting for your valuable comments

bowtie2 SAM BAM • 9.1k views
ADD COMMENT
3
Entering edit mode
9.6 years ago

For getting only properly paired alignments, you're better off with samtools view -f 0x2 some_file. Your command will get discordant and concordant reads, just not singletons.

For interpreting the results of flagstat, it'd be helpful to know what's confusing. You have an ~80% alignment rate, which is OK for many organisms. 64% of all of the reads aligned as properly paired, which is good (i.e., ~80% of actual alignments are properly paired). You have a relatively high rate of pairs where only one read aligned (singletons). Of those, most didn't align with a mate aligning to a different chromosome...which is good unless you're expecting a lot of chromosomal rearrangements.

ADD COMMENT
0
Entering edit mode

Thank you Sir for your kind reply

confusion:

46641928 + 0 with itself and mate mapped

168844 + 0 with mate mapped to a different chr

145423 + 0 with mate mapped to a different chr (mapQ>=5)

how to interpret these lines .

ADD REPLY
1
Entering edit mode

The first number is the number of concordant and discordant mappings. The other two are looking at a subtype of discordant mappings. Normally we expect fragments that we sequence to contain sequence from a single chromosome. In some cases (e.g., in cancer), there may be chromosomal rearrangements, causing many reads to span multiple chromosomes. This can also happen due to artifacts introduced in library prep. and/or mapping. These last two numbers are giving you information on that. So you have 16884 reads where the mates mapped to different chromosomes. One possibility is that this is just due to one mate of the pair being poorly mapped, meaning that it would have a low MAPQ score. So the second one is giving you information about the number with a somewhat decent MAPQ score. In short, the 168844 number isn't due to poor mapping.

ADD REPLY
0
Entering edit mode

A great answer definitely. I have a problem about the procedure after filter. I used the command samtools view -hf 0x2 some_file for the sam file generated by bwa-mem to get only properly paired alignments for further analysis. After using the samtools sort to sort the bam file, am error occurred when I used samtools view to check the sorted bam file.

[W::sam_read1] parse error at line 1 [main_samview] truncated file.

And this bam file cannot be used for index either. I guess it is probably because the header of filtered sam file is still same as the header of the original sam file? Looking forward to your reply. Much appreciate.

ADD REPLY

Login before adding your answer.

Traffic: 2543 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6