Question

Updated R Package To Analyse Eqtl, And Tutorials Available For The Association Of Genetic Variants And Gene Expression

3

Entering edit mode

13.6 years ago

Fra ▴ 30

Hi,

I am looking for the best software used today to do eQTL analysis (the association of SNPs genotyped using a chip with genome-wide expression levels), and an updated step-by-step tutorial that could guide me. I am having trouble keeping up with all the different r packages that keep getting updated. I would like to use snpMatrix which is included now in snpStats, or GGtools which seems like a very nice package, but I am finding it very confusing to understand since it is tied to other packages (like snpmatrix) for the input and initial steps. There are great tutorials for GGtools describing what to do once the data is already inputed in this format, but I can't find a good tutorial which describes the steps from the start.

I am not sure whether there is any detailed tutorial which outlines for example:

Input file description
association of Snps and expression (i.e. expression as outcome)
Find cis eQTL (within for example 1 Mb of start/stop site of all the genes) and trans eQTL, and 4. p-value adjustements using permutation and Bonferroni.

If such a tutorial does not exist, could you please point me to the most updated tutorial, if there is one, which apply to the current versions of these softwares?

Any help is very appreciated. Thank you very much.

r eqtl • 9.3k views

ADD COMMENT • link updated 6.5 years ago by Biostar 20 • written 13.6 years ago by Fra ▴ 30

score 4 · Answer 1 · 2011-11-28

Briefly, a tutorial was conducted at ISMB 2011. The resources are posted at

http://ismb11gg.wordpress.com/

Your post mentions "snpmatrix" but the bulk of infrastructure for genotype representation (including representation of uncertain imputed genotypes) and statistical testing is from the snpStats package of D. Clayton. The snpStats package includes routines for importing genotype data in various formats including ped and mach files.

The significance and value of innovative representations of large-scale genotyping results will change over time as computing and genetic methods evolve. Here is a brief sketch of how to create a SnpMatrix instance. Assume you have snpStats package installed. Basic data encoding genotypes is in matrix x, assumed to represent two individuals genotyped at 4 loci. R values of 0 correspond to missing data, and values of 1, 2, 3 correspond to A/A, A/B, B/B respectively. You can use row and column names to clarify identities of samples and loci, I skip this.

> x = matrix(c(0,1,2,3,1,1,2,0), nr=2)
> x
     [,1] [,2] [,3] [,4]
[1,]    0    2    1    2
[2,]    1    3    1    0

Transform this information to raw bytes. This helps keep memory images small.

> rx = matrix(as.raw(x),nr=2)
> rx
     [,1] [,2] [,3] [,4]
[1,]   00   02   01   02
[2,]   01   03   01   00

Now create the S4 instance.

> library(snpStats)
> sm = new("SnpMatrix", rx)
object has no names - using numeric order for row/column names
> sm
A SnpMatrix with  2 rows and  4 columns
Row names:  1 ... 2 
Col names:  1 ... 4 
> as(sm, "character")
     [,1]  [,2]  [,3]  [,4] 
[1,] "NA"  "A/B" "A/A" "A/B"
[2,] "A/A" "B/B" "A/A" "NA"

The tutorial document gives some information on how contents of the byte additional to 0, 1, 2, 3 are used to represent uncertain genotypes.

The SnpMatrix instance can now be used for large scale testing using facilities in snpStats package. GGtools makes heavy use of snp.rhs.tests.

Once you have a SnpMatrix representing genotyping results (or a collection of SnpMatrix instances, perhaps one per chromosome, if you have many millions of loci to handle), you can bind it to expression data in an smlSet instance. GGtools documentation and the tutorial noted above take you from there. I would be happy to hear about limitations or alternatives to these approaches.

VJ Carey, GGtools developer, accessible through the bioconductor mailing list.

score 1 · Answer 2 · 2011-11-28

This is getting to be very detailed with respect to R/Bioconductor, so you might want to pose further queries to that mailing list. I have no way of reproducing the details of your work, but I can show a successful workflow using the test data supplied with plink 1.07. First, convert the test.ped file supplied with plink. This yields plink.bed, bim, fam

bash-3.2$ ./plink --file test --make-bed

@----------------------------------------------------------@
|        PLINK!       |     v1.07      |   10/Aug/2009     |
|----------------------------------------------------------|
|  (C) 2009 Shaun Purcell, GNU General Public License, v2  |
|----------------------------------------------------------|
|  For documentation, citation & bug-report instructions:  |
|        http://pngu.mgh.harvard.edu/purcell/plink/        |
@----------------------------------------------------------@

Web-based version check ( --noweb to skip )
Connecting to web...  OK, v1.07 is current

Writing this text to log file [ plink.log ]
Analysis started: Mon Nov 28 14:28:35 2011

Options in effect:
        --file test
        --make-bed

2 (of 2) markers to be included from [ test.map ]
6 individuals read from [ test.ped ] 
6 individuals with nonmissing phenotypes
Assuming a disease phenotype (1=unaff, 2=aff, 0=miss)
Missing phenotype value is also -9
3 cases, 3 controls and 0 missing
6 males, 0 females, and 0 of unspecified sex
Before frequency and genotyping pruning, there are 2 SNPs
6 founders and 0 non-founders found
Total genotyping rate in remaining individuals is 1
0 SNPs failed missingness test ( GENO > 1 )
0 SNPs failed frequency test ( MAF < 0 )
After frequency and genotyping pruning, there are 2 SNPs
After filtering, 3 cases, 3 controls and 0 missing
After filtering, 6 males, 0 females, and 0 of unspecified sex
Writing pedigree information to [ plink.fam ] 
Writing map (extended format) information to [ plink.bim ] 
Writing genotype bitfile to [ plink.bed ] 
Using (default) SNP-major mode

Analysis finished: Mon Nov 28 14:28:35 2011

Now we work with R 2.14.

> xx = read.plink("plink.bed")
> xx
$genotypes
A SnpMatrix with  6 rows and  2 columns
Row names:  1 ... 6 
Col names:  snp1 ... snp2

$fam
  pedigree member father mother sex affected
1        1      1     NA     NA   1        1
2        2      1     NA     NA   1        1
3        3      1     NA     NA   1        1
4        4      1     NA     NA   1        2
5        5      1     NA     NA   1        2
6        6      1     NA     NA   1        2

$map
     chromosome snp.name cM position allele.1 allele.2
snp1          1     snp1 NA        1        A        C
snp2          1     snp2 NA        2        G        T

> es = new("ExpressionSet", exprs=matrix(0, nr=10, nc=6))
> ss = make_smlSet(es, list("1"=xx$genotypes))
> ss
SnpMatrix-based genotype set:
number of samples:  6 
number of chromosomes present:  1 
annotation:  
Expression data dims: 10 x 6 
Phenodata: An object of class "AnnotatedDataFrame": none

I don't know why your construction attempt failed. This shows there is nothing special about working from bed bim files to import genotype data for use with GGtools.

> as(smList(ss)[[1]], "character")
     [,1]  [,2] 
[1,] "A/A" "A/B"
[2,] "A/B" "A/B"
[3,] "B/B" "A/A"
[4,] "A/B" "B/B"
[5,] "B/B" "A/B"
[6,] "B/B" "B/B"

> sessionInfo()
R version 2.14.0 Under development (unstable) (2011-06-06 r56067)
Platform: x86_64-apple-darwin10.7.0/x86_64 (64-bit)

locale:
[1] en_US.US-ASCII/en_US.US-ASCII/en_US.US-ASCII/C/en_US.US-ASCII/en_US.US-ASCII

attached base packages:
[1] splines   stats     graphics  grDevices datasets  utils     tools    
[8] methods   base

other attached packages:
 [1] GGtools_3.99.33       GenomicRanges_1.5.49  org.Hs.eg.db_2.5.0   
 [4] rtracklayer_1.13.17   RCurl_1.6-10          bitops_1.0-4.1       
 [7] IRanges_1.11.27       annotate_1.31.1       AnnotationDbi_1.15.23
[10] GGBase_3.13.5         RSQLite_0.9-4         DBI_0.2-5            
[13] snpStats_1.3.4        Matrix_0.9996875-3    lattice_0.19-33      
[16] survival_2.36-9       Biobase_2.13.9        BiocInstaller_1.2.1  
[19] weaver_1.19.0         codetools_0.2-8       digest_0.5.1

loaded via a namespace (and not attached):
[1] Biostrings_2.21.10 bit_1.1-7          BSgenome_1.21.5    ff_2.2-3          
[5] grid_2.14.0        Rsamtools_1.5.67   XML_3.4-3          xtable_1.5-6      
[9] zlibbioc_0.1.7

score 0 · Answer 3 · 2011-11-28

Thank you very much for your explanations! I am still having trouble though trying to create a smlSet from the plink files and can't find a good explanation about this anywhere. My ExpressionSet is called 'es', and I imported my genotype using the command:

snp.matrix<-read.plink("plink.bed", "plink.bim", "plink.fam")

Which is a list with 3 elements including $genotype:

str(snp.matrix$genotype)
Formal class 'SnpMatrix' [package "snpStats"] with 1 slots
  ..@ .Data: raw [1:2, 1:4] 03 03 03 03
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:2] "1" "2"
  .. .. ..$ : chr [1:4] "1" "2" "3" "4"

But I must be missing a step because when I try to create my smlSet using the expression 'es' and the snp$genotype I get an error message:

msl=make_smlSet(es, list(`1`=snp.matrix$genotypes))
Error in make_smlSet(es, (snp.matrix$genotypes)) : 
sml must be list of SnpMatrix instances from 'snpStats' package

It seems as the snp.matrix must be in the form of a list, so maybe I have to subset the snp.matrix to only one chromosome, or the list of chromosomes?
Thank you again for the help, really appreciate it!