How extract data from gnomAD for some genes?
2
0
Entering edit mode
4.6 years ago
star ▴ 350

I like to extract variant details about some genes from gnomAD 'https://gnomad.broadinstitute.org/', but I don`t want to do it by hand. I would like to know, is there any way to extract it by command line for below genes?

Table A:

RFPL3
RGMA
RGS11
RHOT1
RIPOR1
RIPOR3
RNASE9
RNF145
RNF24
RPL17
RPL17-C18orf32
RPL19
RPL7
SAMD3
SCAMP4
SCNM1

There is a 'gsutil ls gs://gnomad-public/release/2.1.1' command in gnomad but I don`t know how can I use it to extract data for my list?

gnomAD data variants • 3.1k views
ADD COMMENT
2
Entering edit mode

gsutil can be used to download data. For the moment, you'll either need to download the entire VCF and do some extraction + manual processing or download CSV files gene-by-gene.

ADD REPLY
1
Entering edit mode
2.2 years ago
Kalin ▴ 50

I created a python package based on SQLite databases, where you can easily query all gnomAD variants for GRCh37/38. https://github.com/KalinNonchev/gnomAD_DB I have precomputed SQLite databases for gnomAD WGS for GRCh37/38 in the description of the package. Please take a look there. You could query the exon intervals of the genes using it.

ADD COMMENT
0
Entering edit mode
4.6 years ago
joe ▴ 490

From R you can use the packages GenomicRanges and/or VariantAnnotation and/or vcfR and follow this workflow;

#defines the region you want to look at
rngs <- GRanges(chr, IRanges(query.start,query.end))
names(rngs) <- gene.name
param <- ScanVcfParam(which=rngs) 
this.vcf.data <- readVcf(genome.vcf, "hg19", param)

#retrieves useable information from vcf
rd <- rowRanges(vcf)
basic.info.df = as.data.frame(rd)
expanded.info.df = as.data.frame(info(vcf))
ADD COMMENT
0
Entering edit mode

This is one part of the solution. It does not address obtaining the VCF file or the genomic co-ordinates of the genes. Given that it is incomplete, I'd recommend that this content be moved to a comment rather than continue being an answer.

ADD REPLY
0
Entering edit mode

Well thank goodness for high standards...below is a bit more which addresses your concerns, although the transcript should be known along with the gene name and the vcf files should be downloaded locally via ftp or gs (or from web browser) here

This will also require RMySQL and I show here retrieving +/- 4000bp but you can change this however you like...

conn <- dbConnect(MySQL(), user = 'genome', password = '', host = 'genome-mysql.cse.ucsc.edu', dbname = 'hg19')

NM.query=c()
NM.query <- dbGetQuery(conn, paste("SELECT * FROM refGene WHERE name = '",RefSeq,"' and name2 = '",gene.name,"'",sep=""))
this.chrom = NM.query$chrom
this.strand = NM.query$strand
if(NM.query$strand == "-"){
  stop.here = NM.query$txEnd + 4000
  start.here = NM.query$txStart - 4000
}
if(NM.query$strand == "+"){
  start.here = NM.query$txStart - 4000
  stop.here = NM.query$txEnd + 4000
}

query.start = start.here
query.end = stop.here

#defines the region you want to look at
rngs <- GRanges(chr, IRanges(query.start,query.end))
names(rngs) <- gene.name
param <- ScanVcfParam(which=rngs) 
this.vcf.data <- readVcf(genome.vcf, "hg19", param)

#retrieves useable information from vcf
rd <- rowRanges(vcf)
basic.info.df = as.data.frame(rd)
expanded.info.df = as.data.frame(info(vcf))

https://gnomad.broadinstitute.org/downloads

ADD REPLY

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6