Hi all, I am new to bioinformatics and would appreciate some help with the following task. I am looking forward to extract the allele frequency counts for each subpopulation in the gnomAD data (https://gnomad.broadinstitute.org/). I.e., I would like to produce a table which reports for every subpopulation (African, Latino, Ashkenazi Jewish, East Asian, etc) and for every position in the genome/exome, the counts of (1) the number of times such position has been succesfully sequenced and (2) the number of times variation has been observed from the underlying reference.
So for example, for each population 'pop' and each site 'site', i would like to have a csv table which tells me - table['pop']['site'][AN] = x (total number of times site 'site' has been successfully sequenced in population 'pop') - table['pop']['site'][AF] = y (total number of times site 'site' has been successfully sequenced in population 'pop', and has shown some given variant wrt reference); for consistency, y<= x.
I received from a collaborator who doesn't work on this project anymore the type of file i need for the ExAC dataset, but it'd be great to upgrade to gnomAD.
I have the vcf file (donwloded it from gnomAD website) and assume the operation I'm describing should be straightforward to implement using vcftools, but I am having troubles extracting this. Any help is appreciated!
thanks :)
p.s. - image attached of ideal outcome for gnomAD: