Samtools view does not output all reads
2
0
Entering edit mode
3.0 years ago
b10hazard • 0

I have an interesting problem. I'm aligning some paired end reads, each read file (read1/read2) has 3235888 reads per file or 3235888 read pairs as input. My alignment command is...

bwa sampe hg19.reference.fa aln.r1.sai aln.r2.sai read1.fq read2.fq > aligned.bam

After alignment I checked my bam file by running samtools flagstat and I got this....

6471776 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
6416289 + 0 mapped (99.14% : N/A)
6471776 + 0 paired in sequencing
3235888 + 0 read1
3235888 + 0 read2
6380400 + 0 properly paired (98.59% : N/A)
6398872 + 0 with itself and mate mapped
17417 + 0 singletons (0.27% : N/A)
12428 + 0 with mate mapped to a different chr
8191 + 0 with mate mapped to a different chr (mapQ>=5)

The numbers match up perfectly with my input. Everything looks okay so far. However, a problem occurs when I try to find certain read pairs in the bam file. For example, this read pair exists in the input fastq files...

@K00123:1845:HCCV6ACX1:2:1113:5997:6622
AGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+
DDDDDIIIIIIIIIIIIHIGIIIIIIIIIIIIGHIIIIIIIIIIIGHII

@K00123:1845:HCCV6ACX1:2:1113:5997:6622
CCGAACACAATAACTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG
+
00000//111111<1<<111111<<F<11DGE@G?11<<<0CEDH<CCE

But when I search for this read pair in the bam using samtools view, I cannot find it! My search command is...

samtools view aligned.bam | grep AGTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
samtools view aligned.bam | grep CCGAACACAATAACTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG

These commands return nothing. At first I though that samtools view might not print all lines so I checked using this command...

samtools view aligned.bam | wc -l 

This command returns 6471776, which indicates that all reads are being printed by the samtools view command and yet I still cannot see that read pair. Any ideas what might be going on here?

samtools bam • 1.6k views
ADD COMMENT
3
Entering edit mode
3.0 years ago

Likely the reads are rev comped in the bam. What happens when you grep the read name?

ADD COMMENT
0
Entering edit mode

as these are read pairs, I would expect one of the two to map directly on the reference

ADD REPLY
0
Entering edit mode

Thanks! The read was reverse complemented and its mate was also reverse complemented. I was able to verify by following your suggestion of grepping the read name. Ugh, such a newbie mistake :/

ADD REPLY
0
Entering edit mode
3.0 years ago
Martombo ★ 3.1k

SAM format reports the 5'->3' reference sequence of the read mapping, so if there are mismatches or the read maps on the other strand you won't be able to find its sequence in the file

ADD COMMENT
0
Entering edit mode

Indeed it seems that there are some mismatches in the alignment of read2:

Range 1: 68793 to 68837
Alignment statistics for match #1
Score   Expect  Identities  Gaps    Strand
78.7 bits(42)   2e-11   44/45(98%)  0/45(0%)    Plus/Plus
Query  5      ACACAATAACTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG  49
              ||||| |||||||||||||||||||||||||||||||||||||||
Sbjct  68793  ACACAGTAACTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG  68837
ADD REPLY
0
Entering edit mode

Ah. Hadn't thought of this. I'll keep this in mind for future work. Thanks for bringing it to my attention.

ADD REPLY

Login before adding your answer.

Traffic: 2652 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6