Reporting More Variants With Samtools
2
1
Entering edit mode
11.5 years ago

I am using samtools mpileup to call variants from my bam file. However, there are situation where for example there is G in reference sequence and there is ten times T in the reads - as I can see in IGV browser

However, in mpileup there is G just six times instead of ten:

contig 555 G 6 TttTtt 811657

I can understand this, maybe quality of reads/bases was not sufficient to get into mpileup.

Finally, this variant is missing from vcf file and I want it there.

I have tried this command: samtools mpileup -uf reference.fasta aln.bam | bcftools view -p 0.75 -cgv - | /usr/share/samtools/vcfutils.pl varFilter >aln.vcf

and than modifications like adding -B or -E parameters with no success. What can I do to force samtools call my variant?

Thank you very much.

EDIT: OK, I might figure it out. Mapping quality of those reads is 0. But if there are hundreads of reads with mapping quality 0, I can trust the variant they suggests, can not I?

samtools variant calling • 3.4k views
ADD COMMENT
2
Entering edit mode
11.5 years ago

I think you shouldn't trust variants from the reads with Mapping quality zero. MQ = 0 means that those reads could be placed at many other locations with the same confidence so I don't think using such reads to call variants would be a wise thing to do. If you think that few of the reads spanning your variants have non zero mapping quality and could be used then I would use -p 0.99 or 1.0 and that would emit a variants that have few supporting reads (though the real defn of p is sth else) but this trick helps.

Thanks

ADD COMMENT
0
Entering edit mode

Thanks for your answer. I have been thinking about it; what If I have duplicated gene with two forms differing only in one narrow part of alignment, then everything else would also have a mapping quality MQ=0, right? So this is one biological explanation, where MQ=0 is actually OK?

ADD REPLY
0
Entering edit mode
11.4 years ago

Well you are right. I am not sure how sensitive is the aligner while assigning mapping quality. I am not sure whether a small non-similar region between two duplicate genes is enough to assign non zero mapping quality to reads aligned on them. I am not an expert but the thing you described above forms the basis of CNV identifying algorithms. They tend to look for regions that have a high coverage compared to the neighboring regions and the whole genome. This way they can be sure that those regions (genes) have multiple copies in the sequenced genome and less copies in reference genome and reads from different but similar regions accumulate at one place with low mapping quality when aligned against reference genome. These regions should have a lot of heterozygous SNPs (Each different but similar region representing its own allele when aligned against reference genome). Normally, people use a term called inaccessible genome where they define the regions of genome that are highly repetitious OR near to indels OR have mapping quality < 20. They don't call SNPs in these regions as they know most of them would be false positive. Some people do the post filtering at the VCF file level where they use RMS (Root mean square mapping quality) to filter these SNPs.

ADD COMMENT

Login before adding your answer.

Traffic: 1719 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6