I have a large dataset of whole genome sequencing data. Recently in a large GWAS study, I learned a number of promising significant hits. I would like to check to see if these these SNPs are associated (with the specific phenotypic trait i'm interested in) in the whole genome sequencing data that I have. The data was genotyped by Illumina. I have the .bam files and the .vcf files that they provided. In these kinds of studies, what is the general work-flow that needs to be done in order to do this type of analysis? Because I have the LD block of these SNPs, my thought was to extract these sections from the WGS data first using SAMTOOLS (or R). Do I need to convert these into vcf files after? And run an association analysis based on my phenotype of interest? Thanks for your help, in advance.
I would run association tests using the whole genome vcf's (both common and rare variant tests, using something like plinkseq). From there, I would see if any hits are in your regions of interest from your chip based study.
Doing it the way you described above has advantages (shorter run time, less storage needed, smaller corrections for multiple tests), but what a shame it would be to ignore so much data! If you decide to go this route, you can just filter your existing vcf's for your regions of interest. No need to use the bam, and then convert into vcf unless you want to do your own variant calling.