Question

Are there any statistical method is suitable to solve

0

Entering edit mode

2.8 years ago

octpus616 ▴ 100

Hi,

I recently pay attention to study the effect of specific genomic variation on gene expression. However I have some confusion about statistical analysis.

Firstly, please allow me use some simulated data to describe the situation I encountered. I have apply RNA-seq on different samples and get a expression matrix like this:

set.seed(1234)
exp_matrix <- replicate(10, rnorm(4))
rownames(exp_matrix) <- paste0("gene_", letters[1:dim(exp_max)[1]])
colnames(exp_matrix) <- paste0("sample_", letters[1:dim(exp_max)[2]])
exp_matrix

         sample_a   sample_b   sample_c    sample_d   sample_e   sample_f   sample_g
gene_a -1.2070657  0.4291247 -0.5644520 -0.77625389 -0.5110095  0.1340882 -0.6937202
gene_b  0.2774292  0.5060559 -0.8900378  0.06445882 -0.9111954 -0.4906859 -1.4482049
gene_c  1.0844412 -0.5747400 -0.4771927  0.95949406 -0.8371717 -0.4405479  0.5747557
gene_d -2.3456977 -0.5466319 -0.9983864 -0.11028549  2.4158352  0.4595894 -1.0236557
         sample_h   sample_i   sample_j
gene_a -0.0151383 -0.7094400 -2.1800396
gene_b -0.9359486 -0.5012581 -1.3409932
gene_c  1.1022975 -1.6290935 -0.2942939
gene_d -0.4755931 -1.1676193 -0.4658975

At the same time, I have a matrix that describes whether there is a interesting variation model on a specific gene region in a specific sample（if hit, noted as 1), Its look like this:

phe_martix <- replicate(10, ifelse(sample(16, 4) > 2, 0, 1))
rownames(phe_martix) <- paste0("gene_", letters[1:dim(exp_max)[1]])
colnames(phe_martix) <- paste0("sample_", letters[1:dim(exp_max)[2]])
phe_martix
           sample_a sample_b sample_c sample_d sample_e sample_f sample_g sample_h sample_i
gene_a        1        0        0        0        0        1        0        0        0
gene_b        0        0        0        1        0        0        0        0        0
gene_c        0        1        0        0        1        0        0        1        0
gene_d        0        0        0        0        0        0        0        0        0
       sample_j
gene_a        0
gene_b        0
gene_c        1
gene_d        0

Please note that the our interesting variation model is common across sample，But for specific genes, it is rare, so I am unlikely to get duplicates at the gene level.

I have three questions here:

Is there any way to statistical method to inference that does this variation lead to up-regulation or down-regulation of gene expression?
For specific hit genes, can I infer whether it has significant difference to background (mean) expression? For example, the gene_b of sample_d is a hit gene, does its expression levels higher or lower than background expression (noted as 0)?
Are the one sample z-test or t-test is suitable to solve Q2?

This is may a layman-like question. But I really confused for it. So I hope hear your kindly advice.

Thanks for you reading.

Best.
Zhang

z-test • 994 views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 2.8 years ago by octpus616 ▴ 100

score 3 · Answer 1 · 2021-06-15

I think you could be looking for expression quantitative trait loci (eQTL).

Here is a short overview article.

But I am not sure if that concept is applicable to your condensed data. You would need the original genotyping information to calculate an association test between variants and gene expression phenotype, not just whether there is any "interesting" (by what means?) variant in a gene (e.g. a SNP). Instead, you have to let the association test decide about variants that may be linked to gene expression. So the answer to Q1 is "yes, association testing".

There is an R-package called Matrix eQTL you could try. If you can convert your data into its input format, that should be fine, but I don't know if you have enough samples to achieve good power to detect anything.

For Q2 and Q3 you needed to define "background expression" first. In principle you could use your binary variable "hit/no hit" to divide samples into two groups (per gene) and do a test for differential expression. What is best for that purpose depends on how the expression values were obtained. A wilcoxon.test would always work but may be severely underpowered with your example data. On the other hand, standard DGE methods like DESeq2 use sample groupings that are the same for each gene and are therefore not applicable.

Edit: here is some R-code, thanks to the reproducible example you gave, that calculates t.test and wilcox.test for each gene, based on the grouping by phenotype.

set.seed(1234)
exp_matrix <- replicate(10, rnorm(4))
rownames(exp_matrix) <- paste0("gene_", letters[1:dim(exp_matrix)[1]])
colnames(exp_matrix) <- paste0("sample_", letters[1:dim(exp_matrix)[2]])

phe_martix <- replicate(10, ifelse(sample(16, 4) > 2, 0, 1))
rownames(phe_martix) <- paste0("gene_", letters[1:dim(exp_matrix)[1]])
colnames(phe_martix) <- paste0("sample_", letters[1:dim(exp_matrix)[2]])

# could also filter the matrix to only include rows with > 2 integration events
mat <- cbind(exp_matrix, phe_martix)
# run the combined matrix through the tests, use try because of lacking samples
ret <- apply(mat, 1, function(x){
    exp <- x[1:ncol(exp_matrix)] 
    phe <- as.factor(x[(ncol(exp_matrix)+1):(2*ncol(exp_matrix))])
    list (
           t.test=try(t.test(exp ~ phe)), 
           w.test=try(wilcox.test(exp ~ phe)))
})
## returns a list of the test results for each gene