I am entirely newbie on WGS. At this point, I have worked just with data from SNP-array (plink format, .vcf, .haps/.sample), and I have to verify the allele frequency of a single SNP in WGS data. Therefore, I have to perform SNP calling, alright?
The thing is, I did download the data in .bam format from ENA. However, I do not know from which reference genome it was generated, and I do need a fasta reference to make the SNP calling via HaplotypeCaller from GATK. Anyone knows how do I get the reference genome (build version) in order the download the right fasta file?
I saw a pretty similar question at this post, where @matted wrote:
In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.
However, as I said I am a newbie. How can I infer the reference by chromosome names and the assembly version by the sizes? My purpose is really straightforward: just to see the allele frequency of a SNP in these data.
Any help will be very appreciated!