Get SNPs with MAF 0.05 for all genes - attempt with BiomaRt
0
0
Entering edit mode
17 months ago
camerond ▴ 190

I have a working script which takes a list of genes and maps all the SNPs with MAF >= 0.05 to each gene using BiomaRt. I have run this for small lists of genes (~250) so far but would like a genome wide solution such that I have a lists of MAF >= 0.05 for all protein coding genes. The current script produces a list of valid rsIDs for each gene (using HGNC IDs) in a folder called {HGNC GENE}_snps and a log file entry for each gene (see below).

My current method is not ideal as:

  1. It's a bit slow.
  2. I regularly receive time outs, I think mainly due to larger genes making the query too lengthy, which means I can't retrieve all the data I need.

Is it possible to download the BioMart databases to run this locally?

Or is there a more efficient way to do this, I don't necessarily need to use BioMart or R.

Here is my script:

library(tidyverse)
library(biomaRt)
library(readxl)

## Set variables  ---------------------------------------------------------------------
IMP_GENES <- snakemake@input$genes
LOG_FILE <- snakemake@output
OUT_DIR <- snakemake@params

##  Load Data  ------------------------------------------------------------------------
genes <- read_excel(IMP_GENES, sheet = 'Human_All_Known IG_NoStatus') %>%
  filter(!grepl("placenta", Gene)) %>%
  drop_na(Gene) %>%
  arrange(Gene) %>%
  pull(Gene) 

##  Use Biormart to extract gene boundaries based on hgnc_symbol  --------------------
genemart = useMart(host="www.ensembl.org",
                   biomart = "ENSEMBL_MART_ENSEMBL",
                   dataset="hsapiens_gene_ensembl")

gene_coord <- getBM(attributes=c("hgnc_symbol", 
                                 "chromosome_name",
                                 "start_position", 
                                 "end_position",
                                 "gene_biotype"),
                    filters = c("hgnc_symbol"),
                    values = list(genes), mart = genemart)


##  Use Biormart to filter SNPs in gene boundaries on MAF > 0.05   --------------------
snpmart <- useMart(host="www.ensembl.org", 
                   biomart="ENSEMBL_MART_SNP", 
                   dataset="hsapiens_snp")

sink(paste0(LOG_FILE), append = FALSE, split = TRUE)

for (i in 1:nrow(gene_coord)) {

  tryCatch({

    gene_ID <- gene_coord$geneID[i]
    gene_chr <- gene_coord$chr[i]
    gene_start <- gene_coord$start[i]
    gene_stop <- gene_coord$stop[i]

    cat("\n----------------------------------------------------------\n")
    cat(paste0("Retrieving SNPs for: ", gene_ID, " - chr", gene_chr,":", gene_start, "-", gene_stop))
    cat("\n----------------------------------------------------------\n")

    snps <- getBM(attributes=c("refsnp_id", "allele", "chr_name", "chrom_start", "chrom_end", "minor_allele_freq"),
                  filters = c("chr_name", "start", "end"),
                  values = list(gene_chr, gene_start, gene_stop), mart = snpmart)

    snps_maf <- snps[ which(snps$minor_allele_freq >= 0.05), ]

    cat(paste0("Total loci in ", gene_ID, ": ", nrow(snps)))
    cat(paste0("\nTotal loci after MAF filter (>= 0.05): ", nrow(snps_maf)))
    cat("\n----------------------------------------------------------\n")

    write.table(snps_maf, paste0(OUT_DIR, gene_ID, "_snps"), quote = FALSE, sep = "\t",
                col.names = FALSE, row.names = FALSE )

  }, error = function(e) {

    cat("ERROR: For ", gene_ID, "\n", conditionMessage(e), "\n")

  })

}  

sink()

And an example of the log entry:

----------------------------------------------------------
Retrieving SNPs for: ZDBF2 - chr2:206274663-206314427
----------------------------------------------------------
Total loci in ZDBF2: 9422
Total loci after MAF filter (>= 0.05): 40
----------------------------------------------------------

----------------------------------------------------------
Retrieving SNPs for: ZFAT - chr8:134477788-134713049
----------------------------------------------------------
ERROR: For  ZFAT 
 Timeout was reached: [www.ensembl.org:80] Operation timed out after 300000 milliseconds with 1062424 out of -1 bytes received

Many Thanks.

BiomaRt SNPs MAF genes • 331 views
ADD COMMENT

Login before adding your answer.

Traffic: 2987 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6