How to remove singletons in a BAM file using samtools
1
2
Entering edit mode
3.8 years ago
neranjan ▴ 50

Dear All,

I want to remove the singletons from the aligned bam file.

fastq-dump --split-files SRR1517848


Then aligned the paired end fastq files, using BWA by:

bwa mem hg19.fa R_1.fa R_2.fa -o SRR1517848.sam


Then converted it to BAM format and I used samtools to remove the singletons using : according to the paper, article

samtools view -@ 8 -F 0x04 -b SRR1517848.sam > SRR1517848.bam


Then I looked into the stats using samtools flagstat command by;

samtools flagstat SRR1517848.bam


which gave;

   Before filtering
4549753 + 0 mapped (98.61% : N/A)
4609656 + 0 paired in sequencing
4461356 + 0 properly paired (96.78% : N/A)
4517788 + 0 with itself and mate mapped
27501 + 0 singletons (0.60% : N/A)
39296 + 0 with mate mapped to a different chr
33576 + 0 with mate mapped to a different chr (mapQ>=5)


And after filtering

4549753 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4464 + 0 supplementary
0 + 0 duplicates
4549753 + 0 mapped (100.00% : N/A)
4545289 + 0 paired in sequencing
4461356 + 0 properly paired (98.15% : N/A)
4517788 + 0 with itself and mate mapped
27501 + 0 singletons (0.61% : N/A)
39296 + 0 with mate mapped to a different chr
33576 + 0 with mate mapped to a different chr (mapQ>=5)


So my Questions:

1) So the command to remove the singletons by -F 0x04 have only removed unmapped reads ?

2) In both before and after stats the singletons remain, and what are those?

Before: 27501 + 0 singletons (0.60% : N/A)

After : 27501 + 0 singletons (0.61% : N/A)

3) Is there a way to remove singletons in variant detection GATK pipeline ? what benefits does it have?

Thanks

BWA alignment samtools singletons • 4.2k views
5
Entering edit mode
3.8 years ago
rizoic ▴ 240

As per the SAM flag guide the flag that you are filtering for i.e. 0x04 is for unmapped reads. This explains the removal of unmapped reads.

In case you want to filter for singletons i.e. reads for which the mate is unmapped you an select that as the condition. This gives a SAM flag value of 0x08. You can hence change your command to

samtools view -@ 8 -F 0x08 -b SRR1517848.sam > SRR1517848.bam


This should remove the singleton reads.

3
Entering edit mode

I'd also add the header to the bam file (by adding the -h parameter), to avoid problems in downstream analysis.

0
Entering edit mode

Thanks when I used the above command it gave singletons removed reads:

4549662 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4373 + 0 supplementary
0 + 0 duplicates
4522161 + 0 mapped (99.40% : N/A)
4545289 + 0 paired in sequencing
4461356 + 0 properly paired (98.15% : N/A)
4517788 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
39296 + 0 with mate mapped to a different chr
33576 + 0 with mate mapped to a different chr (mapQ>=5)


As for SAM flag guide when I use 0x08 I am removing mate unmapped (0x8) reads, and with a warning: Flag(s) and 0x8 cannot be set when read is not paired

eg: R1 and R2 are mate pairs, where:
R1 = mapped (mate is unmapped)
R2 = unmapped .
So with this I am removing the mate unmapped reads only ?
i.e so I am removing R2 reads and keeping the R1 ?

2) My Second question is How important is to remove the singletons in GATK variant detection pipe line? Is this a step that I need to be doing? or can I avoid this?

0
Entering edit mode

Thanks when I used the above command it gave singletons removed reads:

4549662 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4373 + 0 supplementary
0 + 0 duplicates
4522161 + 0 mapped (99.40% : N/A)
4545289 + 0 paired in sequencing
4461356 + 0 properly paired (98.15% : N/A)
4517788 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
39296 + 0 with mate mapped to a different chr
33576 + 0 with mate mapped to a different chr (mapQ>=5)


As for SAM flag guide when I use 0x08 I am removing mate unmapped (0x8) reads, and with a warning: Flag(s) and 0x8 cannot be set when read is not paired

eg: R1 and R2 are mate pairs, where:
R1 = mapped (mate is unmapped)
R2 = unmapped .
So with this I am removing the mate unmapped reads only ?
i.e so I am removing R2 reads and keeping the R1 ?

2) My Second question is How important is to remove the singletons in GATK variant detection pipe line? Is this a step that I need to be doing? or can I avoid this?

1
Entering edit mode

In your e.g. R1 is the singleton read since its mate is unmapped and thus R1 read will be removed when you filter for the flag 0x08.

0
Entering edit mode

Thanks for the clarification. So why is it important to remove the singleton ?
(So my thinking is::: since its mate can not be mapped to the chromosome , the mapped read can not be accurately determined. )