Download hundreds of genes' variant csv from gnomAD
2
2
Entering edit mode
5.0 years ago
Qingyang Xiao ▴ 160

Now I have 500 genes of interest that I want to download from gnomAD for SNP analysis.

It will take forever if I type the each gene name and click the button "Export to csv".

How can I do that in batches?

genome SNP • 4.6k views
ADD COMMENT
5
Entering edit mode
5.0 years ago
wget -O - "https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz" | gunzip -c | grep -E '(^#|\|(GENE1|GENE2|GENE3|GENE4)\|)' > genes.vcf
ADD COMMENT
3
Entering edit mode

If you are interested in specific genes, you would probably want to use gnomAD exomes, not genomes. It's based on more samples and the file is substantially smaller.

ADD REPLY
1
Entering edit mode

Small suggestion: If you have the disk space (something in the order of ~1TB), you could output wget to a temporary (i.e. wget -O - "https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz" > gnomad.vcf.bgz) and then query the file with gunzip + grep after in case you want to look at different genes, or you notice a typo etc. You could also do it per chromosomes and only grep the genes that match the chromosomes you need (see download page).

Since you have 500 genes, you could also put them in a text file (one gene per row) and provide the file as your list of search strings by modiying the grep part here to do gunzip -c gnomad.vcf.bgz | grep -E -f mygenes.txt.

Also keep in mind that grep with match whatever text is present; if you have gene symbols and some gene is a substring of something unrelated, it'll get matched, so you should definitely analyse your output for correct matches.

Finally, do you have gene symbols, or gene identifier (e.g. Ensembl, or RefSeq)? I would download the smallest file (chr21 sites VCF (6.12 GiB)) first and check that your inputs will work with what the gnomAD vcf provides, and then try on the whole dataset.

ADD REPLY
0
Entering edit mode

But, VCF files don't have gene names/symbols, correct? Maybe have to convert your gene name list into start:end coordinates. I have a similar task, and I'd love help on the matter.

ADD REPLY
0
Entering edit mode

Thanks. If I download .csv file directly from gnomAD, the data is integrated from both gnomAD Genomes and Exomes. But the code above for me only contains the data from only Genomes. Could I get the data integrated from Genomes and Exomes, just like I directly click to download?

ADD REPLY
0
Entering edit mode

I don't think there's a single file with both (officially at least) but the exome variants are at https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz (link from the gnomAD download page: https://gnomad.broadinstitute.org/downloads).

ADD REPLY
0
Entering edit mode
2.2 years ago
Kalin ▴ 50

I created a python package based on SQLite databases, where you can easily query all gnomAD variants for GRCh37/38. https://github.com/KalinNonchev/gnomAD_DB I have precomputed SQLite databases for gnomAD WGS for GRCh37/38 in the description of the package. Please take a look there.

ADD COMMENT

Login before adding your answer.

Traffic: 2423 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6