Samtools Snp Calling
2
0
Entering edit mode
12.3 years ago
Joanne Lim ▴ 20

Hi,

I am currently using samtools/bcftools to call SNPs. My question is, how does samtools/bcftools call out alternative alleleS as opposed to the reference?

example 1:

ctg1117585322335 54283 . A G,T 49 . DP=21;AF1=1;CI95=1,1;DP4=0,0,0,21;MQ=29;FQ=-87 GT:PL:GQ 1/1:82,60,0,82,47,79:99

mpileup ctg1117585322335 54283 A 21 ggggggggggggggggtgggg 333333344343343304444

example 2:

ctg1117585322225 4142 . C T,G 185 . DP=29;AF1=1;CI95=1,1;DP4=0,0,0,29;MQ=53;FQ=-108 GT:PL:GQ 1/1:218,81,0,209,47,203:99

mpileup ctg1117585322225 4142 C 29 ttttttttttttttgttttttttttttgt JIIJJIIJJIJIJI9JJFI:IIIHJIJ2C

From example 1, the reported depth is 21. As shown in mpileup, G has a depth of 20 as opposed to T which has 1. In this case, I would like to further filter the SNPs found where only the SNPs that has more than a depth of 10 would be called.

When calling for alternative alleleS, is there a cut-off in terms base quality/depth for the second alternative allele ( in this case, T) to be called?

Thanks

Joanne

snp calling samtools • 6.8k views
ADD COMMENT
1
Entering edit mode
12.3 years ago

The short answer to your question is that you cannot eliminate the T directly using samtools. However, you do not really need to. You will need to look at the GT tag per sample to determine the genotype called. In this case, I would not be surprised if the GT tag showed 1/1, which means G/G as a genotype. Thus, though the T shows up as an alternate allele, it may not be included in any genotype.

If this does not make sense, then please edit your question to include the entire VCF record for that locus.

ADD COMMENT
0
Entering edit mode

Hey Sean, I've edited my question. Aprreciate the inputs!

ADD REPLY
0
Entering edit mode

The depth associated with each alternate allele is not included in the VCF output from either GATK or samtools, I do not think. You'll need to write a small script to do the counting based on the mpileup results, for example.

ADD REPLY
1
Entering edit mode
12.3 years ago

Hi,

In my view the good way of doing what you ask for is to compute the probabilities of all the possible alleles given the pileup and the qualities (based on some reasonable model, typically written using the Bayes theorem), and then decide which alleles to accept based on their probability (or on some score related to their probability).

The alternative route -- write an ad hoc set of filtering rules based on counts and qualities -- more often than not results in a combinatorial morass, which is then hard to explain to other people.

How do you do what I suggest? I am not entirely sure what the best way is, but I offer three remarks:

-I am looking at variants stored in VCF4.0 format (in 1000genomes) and I noticed there are fields indicating the probabilities of the possible alleles which is, as per the above, just what you need. Possibly you can also produce such output in the course of your analysis?

-I had a cursory look at GATK so I don't remember precisely, maybe it also prints out some useful score on which you can apply thresholds?

-I wrote a software which implements a Bayesian model, exactly to solve you problem you describe. It is unpublished, but we use it to do variant analysis. I can post it somewhere.

ADD COMMENT
0
Entering edit mode

Hi, thanks for the inputs. Counting and filtering seems to be the possible option. May I know if there's anywhere that I could get to try out your software? Also, I am still pretty curious though, when samtools calls for SNPs, is there a cut-off in base quality/depth for the second alternative allele to be called or it just calls out every probable alternative alleles?

ADD REPLY
0
Entering edit mode

Hi, counting and filtering is an option, but one has to keep in account counts and qualities, consistently. I don't know what samtools does (it might also depend on the version you are using) but I'd rather have it printing out all the possible alleles as a default behaviour. If I run my model e.g. on the second pileup, I find that the probability of the "majority" choice (TT) is roughly 6000 times that of GT. You can use these numbers to setup a consistent threshold for all your variants. I can post a MACOSX executable, or do you prefer a linux statically compiled?

ADD REPLY
0
Entering edit mode

Speaking of quality ( I am abit confused) In samtools vcfutils.pl script,we are filtering the RMS mapping quality. But in your opinion, in terms of genotyping, which would be more ideal? to filter based on the QUAL of the SNP called/based on the RMS mapping quality? Can you please post the executables for linux? Thanks!

ADD REPLY
0
Entering edit mode

So, the RMS mapping quality is one of the ingredients which goes into determining the QUAL number, as far as I understand (I could be wrong : it depends on the details of how samtools works, checking their paper would be useful, I read it once but forgot the formulae). You can certainly filter on both things.

ADD REPLY

Login before adding your answer.

Traffic: 1842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6