ddRADseq: GATK not matching STACKS refmap in terms of resolution?
0
0
Entering edit mode
14 months ago
Valentin • 0

Dear Biostars community,

First of all, this is my first question here (and in any scientific online community, for that matter), so please excuse if I miss any rules/guidelines. Please let me know in case, I am happy to learn!

A little bit of background:

I am working on ddRADseq dataset of >1200 samples of various species of non-model plants (all the same genus). My samples comprise diploids, tetraploids and hexaploids and the goal is to conduct phylogentic and population genetic analyses with these data.

Unfortunately, I am not a bioinformatician, so all of this is quite new but exciting to me. In our research group, people have exclusively worked with STACKS in the past, mostly due to the lack of reference genome for many of our study species.

However, since I am working on polyploids, STACKS is not an option for me. Luckily, I do have a published reference genome available (not exactly from any of my species, but from the same genus and nicely assembled to chromosome level).

My problem:

Therefore, my plan was to use GATK for variant calling. In pricipal, this worked fine for me, my only issue is the following:

For a subset of my samples (exclusively diploids), I also run STACKS refmap.pl (on the same .bam files as GATK) just for comparison, but ended up with much better resolution than GATK provided for the same subset of diploid samples. With better resolution I mean mostly clearer separation of species in a Neighbor-Net based on Nei's distances and PCAs and more consistent grouping of individuals from the same population (not consistently the case in the GATK output). Overall, my species are not strongly divergent, which is probably the result of both biological reasons (most likely young species and frequent hybridization) and the limitations of RADseq, but still, I get quite a better separation when using STACKS.

My workflow:

To describe my workflow a bit, I performed the following:

for GATK:

  • alignment of demultiplexed reads to the indexed reference genome using bwa with default settings
  • adding read groups, sorting and indexing of .bam files using picard
  • per-sample Haplotype calling with GATK HaplotypeCaller in GVCF mode, specifying the ploidy of each sample. Otherwise default settings.
  • Combining of GVCFs using GenomicsDatabaseImport using an intervals list of 24 evenly-sized regions (6 chromosomes à 4 regions)
  • Combined genotyping using GenotypeGVCFs, the same Intervals list and the "--all-sites" flag. Otherwise default settings.
  • This gave me 24 raw vcf files which I filtered largely following the Broad's recommendations for hard filtering:
##### 
keep only biallelic SNPs:
##### 
gatk --java-options "-Xmx4g" SelectVariants \
-V ${raw_dir}/${vcf_name}.vcf.gz \
-O ${filtered_dir}/${vcf_name}_SNP2.vcf.gz \
--select-type-to-include SNP \
--restrict-alleles-to BIALLELIC 

#####
hard filtering of SNPs:
##### 
gatk --java-options "-Xmx4g" VariantFiltration \
-R ${reference} \
-V ${filtered_dir}/${vcf_name}_SNP2.vcf.gz \
-O ${filtered_dir}/${vcf_name}_QUAL_QD_FS_SOR_MQ_MQRS_RPRS_SNP2.vcf.gz \
-filter 'QUAL < 40' --filter-name 'QUAL40' \
-filter 'QD < 2.0' --filter-name 'QD2' \
-filter 'FS > 60.0' --filter-name 'FS60' \
-filter 'SOR > 3.0' --filter-name 'SOR3' \
-filter 'MQ < 40.0' --filter-name 'MQ40' \
-filter 'MQRankSum < -12.5' --filter-name 'MQRankSum-12.5' \
-filter 'ReadPosRankSum < -15.0' --filter-name 'ReadPosRankSum-15'

I then used these pre-filtered vcfs to filter individual genotypes for read depth (DP). For the full dataset, I did this specifically for each ploidy and combined the vcfs afterwards, but since the subset I compared with STACKS contains only diploids, I kept the same cutoffs for all samples.

Initially, I filtered for minimum DP of 8 and maximum DP of 200, but I varied these settings quite a bit and also used stricter thresholds, but this did not change the output fundamentally.

After filtering, I set all variants which failed the filtering process to no-call and kept only sites present in at least 50% of individuals.

for STACKS:

For STACKS, I used a much simpler workflow based on exactly the same .bam files as for GATK.

  • SNP calling using STACKS refmap.pl with default settings
  • filtering with "Populations" using a minimum minor-allele frequency of 3 individuals, maximum heterozygosity of 0.65 and discarding loci present in less than 50% of samples. I didn't even bother with a depth filter in this case.

While GATK yielded more SNPs, resolution was higher in the STACKS ouput, despite the rather rudimentary filtering. One might argue that this could also be some spurious clustering/artifacts, but since the result makes sense biologically (better grouping of species/populations), I kind of doubt it.

I also tried a MAF filter for the GATK vcf, and while it slightly improved the result, it still did not come close to the STACKS ouput. At some point, I thought that also the reference genome (or rather its divergence from my species) might be problematic, but about 80% of reads were successfully mapped (albeit only c. 60% properly paired, if I remember correctly). But since I used the same .bam files for STACKS I'm not so sure that this might be the reason of my problems, either.

Anyway, I am now somewhat stuck and ran out of ideas. I cannot believe that GATK does not match STACKS in this regard and surely I am missing something obvious, but I simply can't figure it out. By now, I tried so many different filtering settings, but the results never changed dramatically. Still, I believe it must be possible to get the same/very close results from both variant callers.

If you have any ideas how I could improve filtering of the GATK ouput to match STACKS or if you spot any flaws in my approach, I would be very grateful!

And please ask, in case any further details are required.

Thank you very much in advance!

Best regards,
Valentin

polyploidy RADseq variant-calling STACKS GATK • 493 views
ADD COMMENT

Login before adding your answer.

Traffic: 3572 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6