I've made GATK analysis for 340 plant individuals with various ploidy levels ( 2x to 6x) distributed on 18 genetic groups, in order to obtain their genotypes (SNPs and Indels) along a gene. So I started an analysis on HaplotypeCaller (ERC GVCF mode) and I got gVCFs. I have several questions for improving my workflow.
- Given that I have several individuals, I thought of making a multi-sample file ( multi-BAM) before HC, but the problem is that the ploidy level are different for individuals. So is there any parameter or option which allow us to handle the ploidy included in BAM header? Second possibility: I can merge BAM files with the same ploidy level but I am worried there might be a bias in the variant calling according the ploidy level. The third possibility is to make the haplotypecaller analysis individuals by individuals taking into account the level ploidy each time, then to use VCFtools for comparing different outputs. Which possibility do you recommend to me?
- I do not understand the exact meaning of ‘pools’ and ‘cohort’? Can we say that in my case, the cohort is the genetic group and the pools are the subgroups of individuals grouped according to their ploidy level? Is there any risk to obtain a biased analysis, as my genetic groups are not homogeneous for their size?
- Is it normal that I get an empty file when I run RealignerTargetCreator with the following command: java -jar ../GenomeAnalysisTK.jar -T RealignerTargetCreator -R FTM.fasta -I B00H39U.bam -o forIndelRealigner1.intervals
- Finally, what does mean exactly <non-referent> found in the column ALT on the GVCF output? How is it considered when HC determines the genotype?
Thanks for your help