Question: How do I get the unique SNPs for each strain?
gravatar for antonioggsousa
9 months ago by
antonioggsousa910 wrote:


First, I would like to say that I'm new to genome and variant calling data analyses and, of course, to GATK. I read some tutorials and the BestPractices guide before I start. I'm using gatk version

So, let's go to the point. I have a couple of genomes for different yeast strains. I'm interest into SNPs that are unique to each of these strains. So, I started by aligning these against the reference genome available with bwa-mem, convert the SAM to BAM, and sorted the BAM files, for each of the genomes that I have. Then, I used PICARD to fix the read group header and finally I used GATK. Regarding GATK, the commands that I used were the following:

Call variants for each genome alignment file with 'HaplotypeCaller' (´$´ stands for bash variables; "${bam}" are just the prefix of sample names over which I loop):

$GATK HaplotypeCaller -R $DB -I ${IN}/${bam}.sort.rg.bam -O ${OUT}/${bam}.g.vcf -ERC GVCF

Then I realize that in the new version of GATK you need to combine first your multiple '.vcf' files before calling GenotypeGVCFs. In this command I used the 'g.vcf' files generated before to combine the variants of the couple of genomes that I have :

$GATK CombineGVCFs -R $DB --variant ${OUT}/${1}.g.vcf --variant ${OUT}/${2}.g.vcf --variant ${OUT}/${3}.g.vcf --variant ${OUT}/${4}.g.vcf -O ${OUT}/multiple_yeast_strain.g.vcf

Finally I call the variants for the combined gvcf file with 'GenotypeGVCFs' command:

$GATK GenotypeGVCFs -R $DB -V ${OUT}/multiple_yeast_strain.g.vcf -O ${OUT}/genotype_merged.vcf

So far, so good (I think!). Then to get the unique variants, particularly the SNPs, for each strain, I mean the SNPs that appeared in strain A, but does not appear in the remaining strains, the SNPs that appear in strain B, but does not appear in strain A neither in any other strain, and so. For this I thought to use the 'SelectVariants' command in GATK with the option '--discordance', like this (for one of the strains):

I call the variant obtained with 'HaplotypeCaller' for the strain that I'm interested in, lets say strainA.g.vcf, with the '-V' option and I want the SNPs that appear in strainA.g.vcf but not in the joint GenotypeGVCFs command, so I give the file generated previously to '--discordance' and I select only the type of variants that I want: "-select-type SNP"

$GATK SelectVariants -R $DB --discordance ${OUT}/genotype_merged.vcf -V ${OUT}/strainA.g.vcf -select-type SNP -O strainA.unique.g.vcf

Then, when I run the following command to get the no. of SNPs unique to strainA, I got "0 SNPs"

$GATK CountVariants -V strainA.unique.g.vcf

When I wrote this I realize that actually it may not make too much sense to get the unique variants for strainA from a joint gvcf file where the strainA it's there. So, I tried to use the "--concordance" option, just to look if worked and I still got "0 SNPs". So, I'm struggling with what I'm doing wrong and a possible solution to get the unique SNPs for each strain. I hope that you can help me on this.

Thanks in advance,


snp • 339 views
ADD COMMENTlink modified 9 months ago by Pierre Lindenbaum129k • written 9 months ago by antonioggsousa910
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

you must work on ${OUT}/genotype_merged.vcf which contains your two strains. Then use gatk selectVariant with a JEXL expression (not tested)

(vc.getGenotype("strain1").isHomRef()  && !vc.getGenotype("strain2").isHomRef()) || (!vc.getGenotype("strain1").isHomRef()  && vc.getGenotype("strain2").isHomRef())
ADD COMMENTlink written 9 months ago by Pierre Lindenbaum129k

Thank you Pierre Lindenbaum!

I tested and it seem to work. António

ADD REPLYlink written 9 months ago by antonioggsousa910
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1664 users visited in the last hour