Question

Matching Input Df To Multi-Result Output?

0

Entering edit mode

11.3 years ago

hanlonmt • 0

I'm working on some between-species post-GWAS comparisons and have reached a bottleneck in my processing. Any input would be great.

A truncated version of my output looks like this:

SUB_POP        SNP.NAME    CHROMOSOME    POSITION    pVALUE    region
A             SNP-5.3784657.    5    16288402    7.20E-07  5:3734657:3834657
A             SNP-11.26079639.    11    26551225    1.17E-06  11:266501225:266601225
A             SNP-3.30115818.    3    30122780    1.66E-06  3:30072780:30172780

dput

structure(list(SUB_POP = structure(c(1L, 1L, 1L), .Label = "A", class = "factor"), 
SNP.NAME = structure(c(3L, 1L, 2L), .Label = c("SNP-11.26079639.", 
"SNP-3.30115818.", "SNP-5.3784657."), class = "factor"), 
CHROMOSOME = c(5L, 11L, 3L), POSITION = c(16300000, 26551225, 
30122780), pVALUE = c(7.2e-07, 1.17e-06, 1.66e-06), region = structure(c(3L, 
1L, 2L), .Label = c("11:266501225:266601225", "3:30072780:30172780", 
"5:3734657:3834657"), class = "factor")), .Names = c("SUB_POP", 
"SNP.NAME", "CHROMOSOME", "POSITION", "pVALUE", "region"), class = "data.frame", row.names = c(NA, 
-3L))

The 'range' column defines a region on the chromosmoe that contains genes in LD with the SNP, thus making all viable candidates. This portion of the project was completed in rice, which has quite large LD blocks.

I can query biomaRt and get these genes, but then have to match the output back to this original file for further processing. I have written this into a ddply loop but that is both painfully slow and not how (I think) biomart resources should really be used. Any idea of how I can maintain this input code and get some sort of intelligible output that I can use for processing down the line. BiomaRt and such already loaded.

rice_mart <- useMart("plants_mart_21", dataset="osativa_eg_gene")
library(plyr)
gr<-function(x){getBM(attributes=c("ensembl_gene_id","chromosome_name","start_position","end_position","strand","description"), filters=c("chromosomal_region","with_embl","biotype"),values=list(x$region,TRUE,"protein_coding"),mart=rice_mart)}
genes_rice <- ddply (sample,.(SUB_POP,SNP.NAME,POSITION,pVALUE), .progress="text", gr)

This gives me exactly what I want, which is the input df and the biomart results merged together. If I were querying by gene ID or something else, I could easily use a join function, but the region complicates this. Does anyone have experience or ideas on how to approach this?

biomart r gwas • 2.2k views

ADD COMMENT • link 11.3 years ago by hanlonmt • 0

0

Entering edit mode

What are you trying to do overall? ie, what are you starting with and what would you like to find out about it?

ADD REPLY • link 11.3 years ago by Emily 24k