Extracting summary statistics from gnomAD VCF file
Entering edit mode
2.5 years ago
lom • 0

Hi all, I am new to bioinformatics and would appreciate some help with the following task. I am looking forward to extract the allele frequency counts for each subpopulation in the gnomAD data (https://gnomad.broadinstitute.org/). I.e., I would like to produce a table which reports for every subpopulation (African, Latino, Ashkenazi Jewish, East Asian, etc) and for every position in the genome/exome, the counts of (1) the number of times such position has been succesfully sequenced and (2) the number of times variation has been observed from the underlying reference.

So for example, for each population 'pop' and each site 'site', i would like to have a csv table which tells me - table['pop']['site'][AN] = x (total number of times site 'site' has been successfully sequenced in population 'pop') - table['pop']['site'][AF] = y (total number of times site 'site' has been successfully sequenced in population 'pop', and has shown some given variant wrt reference); for consistency, y<= x.

I received from a collaborator who doesn't work on this project anymore the type of file i need for the ExAC dataset, but it'd be great to upgrade to gnomAD.

I have the vcf file (donwloded it from gnomAD website) and assume the operation I'm describing should be straightforward to implement using vcftools, but I am having troubles extracting this. Any help is appreciated!

thanks :)

p.s. - image attached of ideal outcome for gnomAD:

this refers to the ExAC data -- it's a .csv table

gnomAD allele frequency SFS vcf • 1.5k views
Entering edit mode
2.5 years ago

use bcftools query the the correct fields *_AC and *_AN ...


Login before adding your answer.

Traffic: 2269 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6