How do I get the unique SNPs for each strain?
1
0
Entering edit mode
4.5 years ago

Hi,

First, I would like to say that I'm new to genome and variant calling data analyses and, of course, to GATK. I read some tutorials and the BestPractices guide before I start. I'm using gatk version 4.1.4.0.

So, let's go to the point. I have a couple of genomes for different yeast strains. I'm interest into SNPs that are unique to each of these strains. So, I started by aligning these against the reference genome available with bwa-mem, convert the SAM to BAM, and sorted the BAM files, for each of the genomes that I have. Then, I used PICARD to fix the read group header and finally I used GATK. Regarding GATK, the commands that I used were the following:

Call variants for each genome alignment file with 'HaplotypeCaller' (´$´ stands for bash variables; "${bam}" are just the prefix of sample names over which I loop):

$GATK HaplotypeCaller -R $DB -I ${IN}/${bam}.sort.rg.bam -O ${OUT}/${bam}.g.vcf -ERC GVCF

Then I realize that in the new version of GATK you need to combine first your multiple '.vcf' files before calling GenotypeGVCFs. In this command I used the 'g.vcf' files generated before to combine the variants of the couple of genomes that I have :

$GATK CombineGVCFs -R $DB --variant ${OUT}/${1}.g.vcf --variant ${OUT}/${2}.g.vcf --variant ${OUT}/${3}.g.vcf --variant ${OUT}/${4}.g.vcf -O ${OUT}/multiple_yeast_strain.g.vcf

Finally I call the variants for the combined gvcf file with 'GenotypeGVCFs' command:

$GATK GenotypeGVCFs -R $DB -V ${OUT}/multiple_yeast_strain.g.vcf -O ${OUT}/genotype_merged.vcf

So far, so good (I think!). Then to get the unique variants, particularly the SNPs, for each strain, I mean the SNPs that appeared in strain A, but does not appear in the remaining strains, the SNPs that appear in strain B, but does not appear in strain A neither in any other strain, and so. For this I thought to use the 'SelectVariants' command in GATK with the option '--discordance', like this (for one of the strains):

I call the variant obtained with 'HaplotypeCaller' for the strain that I'm interested in, lets say strainA.g.vcf, with the '-V' option and I want the SNPs that appear in strainA.g.vcf but not in the joint GenotypeGVCFs command, so I give the file generated previously to '--discordance' and I select only the type of variants that I want: "-select-type SNP"

$GATK SelectVariants -R $DB --discordance ${OUT}/genotype_merged.vcf -V ${OUT}/strainA.g.vcf -select-type SNP -O strainA.unique.g.vcf

Then, when I run the following command to get the no. of SNPs unique to strainA, I got "0 SNPs"

$GATK CountVariants -V strainA.unique.g.vcf

When I wrote this I realize that actually it may not make too much sense to get the unique variants for strainA from a joint gvcf file where the strainA it's there. So, I tried to use the "--concordance" option, just to look if worked and I still got "0 SNPs". So, I'm struggling with what I'm doing wrong and a possible solution to get the unique SNPs for each strain. I hope that you can help me on this.

Thanks in advance,

António

snp • 1.3k views
ADD COMMENT
2
Entering edit mode
4.5 years ago

you must work on ${OUT}/genotype_merged.vcf which contains your two strains. Then use gatk selectVariant with a JEXL expression (not tested)

(vc.getGenotype("strain1").isHomRef()  && !vc.getGenotype("strain2").isHomRef()) || (!vc.getGenotype("strain1").isHomRef()  && vc.getGenotype("strain2").isHomRef())
ADD COMMENT
1
Entering edit mode

Thank you Pierre Lindenbaum!

I tested and it seem to work. António

ADD REPLY

Login before adding your answer.

Traffic: 1723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6