Hi,
First, I would like to say that I'm new to genome and variant calling data analyses and, of course, to GATK. I read some tutorials and the BestPractices guide before I start. I'm using gatk version 4.1.4.0.
So, let's go to the point. I have a couple of genomes for different yeast strains. I'm interest into SNPs that are unique to each of these strains. So, I started by aligning these against the reference genome available with bwa-mem, convert the SAM to BAM, and sorted the BAM files, for each of the genomes that I have. Then, I used PICARD to fix the read group header and finally I used GATK. Regarding GATK, the commands that I used were the following:
Call variants for each genome alignment file with 'HaplotypeCaller' (´$´ stands for bash variables; "${bam}" are just the prefix of sample names over which I loop):
$GATK HaplotypeCaller -R $DB -I ${IN}/${bam}.sort.rg.bam -O ${OUT}/${bam}.g.vcf -ERC GVCF
Then I realize that in the new version of GATK you need to combine first your multiple '.vcf' files before calling GenotypeGVCFs. In this command I used the 'g.vcf' files generated before to combine the variants of the couple of genomes that I have :
$GATK CombineGVCFs -R $DB --variant ${OUT}/${1}.g.vcf --variant ${OUT}/${2}.g.vcf --variant ${OUT}/${3}.g.vcf --variant ${OUT}/${4}.g.vcf -O ${OUT}/multiple_yeast_strain.g.vcf
Finally I call the variants for the combined gvcf file with 'GenotypeGVCFs' command:
$GATK GenotypeGVCFs -R $DB -V ${OUT}/multiple_yeast_strain.g.vcf -O ${OUT}/genotype_merged.vcf
So far, so good (I think!). Then to get the unique variants, particularly the SNPs, for each strain, I mean the SNPs that appeared in strain A, but does not appear in the remaining strains, the SNPs that appear in strain B, but does not appear in strain A neither in any other strain, and so. For this I thought to use the 'SelectVariants' command in GATK with the option '--discordance', like this (for one of the strains):
I call the variant obtained with 'HaplotypeCaller' for the strain that I'm interested in, lets say strainA.g.vcf, with the '-V' option and I want the SNPs that appear in strainA.g.vcf but not in the joint GenotypeGVCFs command, so I give the file generated previously to '--discordance' and I select only the type of variants that I want: "-select-type SNP"
$GATK SelectVariants -R $DB --discordance ${OUT}/genotype_merged.vcf -V ${OUT}/strainA.g.vcf -select-type SNP -O strainA.unique.g.vcf
Then, when I run the following command to get the no. of SNPs unique to strainA, I got "0 SNPs"
$GATK CountVariants -V strainA.unique.g.vcf
When I wrote this I realize that actually it may not make too much sense to get the unique variants for strainA from a joint gvcf file where the strainA it's there. So, I tried to use the "--concordance" option, just to look if worked and I still got "0 SNPs". So, I'm struggling with what I'm doing wrong and a possible solution to get the unique SNPs for each strain. I hope that you can help me on this.
Thanks in advance,
António
Thank you Pierre Lindenbaum!
I tested and it seem to work. António