I'm working on some between-species post-GWAS comparisons and have reached a bottleneck in my processing. Any input would be great.
A truncated version of my output looks like this:
SUB_POP SNP.NAME CHROMOSOME POSITION pVALUE region
A SNP-5.3784657. 5 16288402 7.20E-07 5:3734657:3834657
A SNP-11.26079639. 11 26551225 1.17E-06 11:266501225:266601225
A SNP-3.30115818. 3 30122780 1.66E-06 3:30072780:30172780
dput
structure(list(SUB_POP = structure(c(1L, 1L, 1L), .Label = "A", class = "factor"),
SNP.NAME = structure(c(3L, 1L, 2L), .Label = c("SNP-11.26079639.",
"SNP-3.30115818.", "SNP-5.3784657."), class = "factor"),
CHROMOSOME = c(5L, 11L, 3L), POSITION = c(16300000, 26551225,
30122780), pVALUE = c(7.2e-07, 1.17e-06, 1.66e-06), region = structure(c(3L,
1L, 2L), .Label = c("11:266501225:266601225", "3:30072780:30172780",
"5:3734657:3834657"), class = "factor")), .Names = c("SUB_POP",
"SNP.NAME", "CHROMOSOME", "POSITION", "pVALUE", "region"), class = "data.frame", row.names = c(NA,
-3L))
The 'range' column defines a region on the chromosmoe that contains genes in LD with the SNP, thus making all viable candidates. This portion of the project was completed in rice, which has quite large LD blocks.
I can query biomaRt and get these genes, but then have to match the output back to this original file for further processing. I have written this into a ddply loop but that is both painfully slow and not how (I think) biomart resources should really be used. Any idea of how I can maintain this input code and get some sort of intelligible output that I can use for processing down the line. BiomaRt and such already loaded.
rice_mart <- useMart("plants_mart_21", dataset="osativa_eg_gene")
library(plyr)
gr<-function(x){getBM(attributes=c("ensembl_gene_id","chromosome_name","start_position","end_position","strand","description"), filters=c("chromosomal_region","with_embl","biotype"),values=list(x$region,TRUE,"protein_coding"),mart=rice_mart)}
genes_rice <- ddply (sample,.(SUB_POP,SNP.NAME,POSITION,pVALUE), .progress="text", gr)
This gives me exactly what I want, which is the input df and the biomart results merged together. If I were querying by gene ID or something else, I could easily use a join function, but the region complicates this. Does anyone have experience or ideas on how to approach this?
What are you trying to do overall? ie, what are you starting with and what would you like to find out about it?