Finding True SNPs after hard filtering on GATK
0
0
Entering edit mode
7.6 years ago
jigarnt ▴ 30

Hi All,

I am working on a non-model organism for which there is no SNP data available. After performing the Hard filtering step on GATK, I have fetched a vcf file which contains more than 50k SNPs. I am undoubtedly sure that it is not a correct number. How to should I proceed to find "TRUE SNPs" out of those 50k?

SNP • 4.1k views
0
Entering edit mode

Define "TRUE SNPs"? I assume you have the reference of the organism that you are working on? Do you have multiple samples? If you have large amount of samples, you can try to calculate the maf of each SNPs. Then depends on your research question, you might select SNPs according to their maf. Again, it is most important to know what you aim at achieving.

0
Entering edit mode

My organism is clonal so what I am looking for is handfull of SNPs. Definition of TRUE SNPs in my case would be a SNP being called at a particular strain and not present in other strains. I have filtered out 4k probable variants through hard filtering on GATK. How should i proceed to get those handfull of "TRUE SNPs"?

Looking forward for your suggestions and inputs.

0
Entering edit mode

So let me make it clear:

1. You align you sample (let's say strain A) onto the reference (strain B???)
2. You want to find the "TRUE SNPs" - SNPs that different from strain A and strain B?
0
Entering edit mode

Yes Sam,

I just want to mention more details so you could exactly grasp my doubt.

My organism is haploid and clonal. There is also no chromosomal annotation available for my organism. So, I expect to find very less variation as there is no sexual reproduction. Up till now I have followed GATK best practices for variant calling for haploid organism followed by hard filtering. I ended up with getting 1.7k variants with PASS. I want to filter out more, So what are the ways I could do that?

I already tried for filtering out homozygous SNPs. It is not working. Any reasons?

0
Entering edit mode

If you have no chromosomal annotation, how did you do the alignment? Do you have some kind of "reference"? If not, how can you even define SNPs?

0
Entering edit mode

Sam,

I have a reference genome which they have assembled using several supercontigs. So I concatened all the supercontigs and assembled raw reads of my strains against the concatenatened reference. Does this makes sense?

0
Entering edit mode

If you want high confidence "SNPs", then you can try the following:

1. Remove any reads with more than one mapping
2. Only call homozygous alternative "SNPs", don't think you will have enough power for finding heterozygous without any additional help.
3. Try and see if there are multiple SNPs close to each-other, that might be an indel instead. As you don't have the reference, I am not sure how GATK's indel realignment performed. This work as an additional level of safety
0
Entering edit mode

Hi Sam,

I have a reference, but I dont have its chromosome annotation. I have already removed indels and screened out homozygous SNPs. After indel remover step I got ended up getting 1.9k SNPs which are homozygous. But, Now I have to filter more as I strongly feel that most of the SNPs are false positive and I expect a handful of SNPs because my organism is clonal and haploid. What should be the way ahead for me?

0
Entering edit mode

Have you filtered by coverage? 8 reads + coverage is generally ok. Have you remove regions that are strange? e.g. Reads with more than 1 SNP calls might be something problematic. For homozygous SNP, due to your current situation, a good idea might be remove SNPs with reference count (e.g. Alt/Ref = 10/1 will still be homo alt, but you might remove it just because you have too much SNPs).

Multiple alignment might also be something for you to consider. Remove reads that align to multiple regions might help you to increase the specificity.

The problem of all these filtering is that they only increase the specificity of your detection but will reduce your sensitivity. Most of the time, we do expect a large amount of SNPs to be observed in Human, however, as we have no experience on your organism, we cannot be sure about the procedure to be used.

0
Entering edit mode

Hi Sam,

I have filtered out homozygous non-reference SNPs. Further I am looking to filter out more on the basis of 0 % reads in reference and !00% reads in sample. On what basis (for eg Depth, Quality score etc) I should filter out more SNPs to get the true variants?

0
Entering edit mode

According to GATK,

QualByDepth (QD) 2.0
FisherStrand (FS) 60.0
RMSMappingQuality (MQ) 40.0
MappingQualityRankSumTest (MQRankSum) 12.5


A long time ago, we also use Depth of 8. So maybe you can try.

0
Entering edit mode

Hi Sam,

I have already used this parameters for hard filtering. I want to filter out more.

0
Entering edit mode

Could you provide me your email, so that I could send you what I have done till now?

0
Entering edit mode

Hi Sam,

What should be an ideal/average depth of a true variant?

My file of hard variants contains depth from 9 to 2030. Does very high depth means that there is an artifact. In this case what should be done to find true variants?

0
Entering edit mode

That all depends on the meaning of true. The problem with your case is, you really don't have that much of information to assist your filtering. Even in normal cases where we were dealing with human genome, with all the additional information, we usually still got a large amount of variants left after filtering. The problem is the genome can inherently mutate, that is, mutation can occurs during replication.

I am not sure which organism you are working with and most likely I have no experience with that, however, I could imagine that even if they are clonal, they will still have mutation during the replication, it is only that the rate of mutation might be different from that in human. Bioinformatics can only bring you that far, and if you really want to find the "true" SNPs, then you can only rely that on Sanger sequencing. In fact, even in human, we can only find SNPs that we are confident of, but we can never said that it is "True" until we perform sanger sequencing validation. All that I can suggest you now is to try and prioritize the SNPs and see whether if any of them look interesting to your research question, and then try and validate them using sanger sequencing. Other than that, there is no valid and correct way to find all the "true" SNPs out from the sequencing data. To be honest, in my point of view, that is impossible with only bioinformatics at the current technological level.

0
Entering edit mode

Hi Sam,

Thank you very much for your detailed inputs to my question. I think I have fetched what I wanted and now I want to move ahead with analysis. I am looking forward to do Maximum parsimony analysis on my data. Which is the best tool for it and can I use a vcf file for the downstream analysis.

0
Entering edit mode

My organism is clonal so what I am looking for is handful of SNPs. Definition of TRUE SNPs in my case would be a SNP being called at a particular strain and not present in other strains. I have filtered out 4k probable variants through hard filtering on GATK. How should I proceed to get those handful of "TRUE SNPs"?

Looking forward for your suggestions and inputs.