How to deal with large inter-sample variation when using GTEx RNA-seq data?
1
0
Entering edit mode
2.6 years ago
Akos ▴ 20

Dear Everyone,

I am quite new to RNA-seq data analysis methods (and statistics) and I hereby I would like to ask for some help, suggestions and personal experience related to dealing with GTEx dataset. Also, my apologies in advance if this is a redundant question, however, when I was reading some forums I did not find satisfactory explanations and suggestions.

For my work, I was planning to use the publicly available GTEx data from human tissues. I would like to compare RNA-seq data from certain tissues from healthy and diseased samples. My aim would be to find Differentially expressed genes and to make additional comparisons later on ... so for this:

1. I downloaded the latest GTEx Raw readcount and TPM matrices.
2. From the RAW readcount matrix, I extracted those columns that matches my criteria. (selected tissues, sex, disease, etc.)
3. As a pilot run , I selected 5 - 5 samples from healthy (HE) and diseased (DIS) (preferably with the highest RNA quality (RIN) values) and run a DEseq2 analyses.

As expected the PCA plot showed quite a large inter-sample variation, but surprisingly the HE and DIS samples were also not separated well. Nevertheless, I got only a few significant DEG, that was not matching with previously reported gene expression profiles.

We thought that we could find more DEG by pre-selecting more "similar" datasets for each conditions. To find more "similar" samples for each condition, I generated a PCA plot from all HE samples and DIS samples separately.

PCA plot

Based on these PCA plots, 10 - 10 samples were selected that were showing smaller distances from each other in both PC. After running DEseq2 on 10 vs 10 samples, the PCA plot showed again large inter-sample variations also within each conditions and returning zero significant DEG.

I also tried the same samples with limma-voom, but ended up with zero DEG again.

After this experience, I would like to ask the following questions:

1. Can we use large public datasets like GTEx for DEG analyses?
2. (If yes) How many samples per conditions are optimal to find DEGs and keeping the inter-sample variation less disturbing?
3. is there any good method , pipeline, tool, etc... to find DEGs when we have many samples per conditions and batch effects?
4. Is it a good idea to pre-select samples that are "more identical" (grouping together) based on a two dimensional PCA plot?
5. Anyone else has similar experience with GTEx?
6. bonus question : is it possible that some samples are mixed up on the GTEx database?

Looking forward to your responses. And thank you in advance for any suggestions and comments.

RNA-Seq GTEx variation deseq2 limma • 1.4k views
0
Entering edit mode

Are you only comparing samples within the GTEx dataset or are you also adding other datasets to the analysis?

1
Entering edit mode

Yes, I do compare only GTEx samples from the same tissues. I am not planning to add other datasets.

3
Entering edit mode
2.6 years ago

That PCA bi-plot looks unusual, however, it is expression from various related tissues, so, it makes sense, in fact. Your groups of interest do not necessarily have to visually segregate by PCA for there to exist statistically significant differences between them.

Where is your evidence of there existing a batch effect?

You should generally not 'pick and choose' (pre-select) samples too much. Just take a large bunch of samples for your study, normalise them together, and then perform the differential expression comparisons between the groups of your choice. By limiting your initial sample selection to just a few samples, you are introducing bias. This said, it is impossible to avoid all sources of bias in everything that we do as human beings.

In general, you should aim for at least 3 samples per group, but better to have 5.

I have some experience with GTEx, but more FANTOM5, TCGA, and CCLE. I have never had problems performing differential expression comparisons in these cohorts, provided that my starting point was raw counts and that I used a reputable differential expression analysis tool, like DESeq2 or EdgeR.

0
Entering edit mode

Thank you for the response. I am glad to see that I was more or less on a good track.

That PCA bi-plot looks unusual, however, it is expression from various related tissues, so, it makes sense, in fact.

Indeed the PCA plot is unusual. I used the following R script on the TPM matrices (the linked image) and also on the raw counts:

MATRIX = read.table("rawcounttable.txt", header = T)
pca= prcomp( MATRIX , center=T, scale=T)
plot(pca$rotation[,1],pca$rotation[,2], xlab = "PC1", ylab = "PC2")
text(pca$rotation[,1],pca$rotation[,2], row.names(pca$rotation), cex=0.5, pos=4)  Where is your evidence of there existing a batch effect? Well, I do not have an evidence per se for a batch effect. I was assuming that the strange PCA plot might be due to some differences during sample collection and/or the sequencing itself (different kit, different flowcells.. etc). All these "hands-on" steps should intrinsically contribute to some level of variation. You should generally not 'pick and choose' (pre-select) samples too much.... In general, you should aim for at least 3 samples per group, but better to have 5. Normally I would not preselect samples. And I won't do it anymore. However, I got a bit surprised when I increased the number of samples / condition (from 5 to 10) , I got less significant DEG. DESeq2 has a normalization step, right? Or should I introduce additional level of normalization? ADD REPLY 0 Entering edit mode What if you plot the x variable from the prcomp output? plot(pca$x[,1],pca$x[,2], xlab = "PC1", ylab = "PC2") text(pca$x[,1],pca$x[,2], row.names(pca$x), cex=0.5, pos=4)


DESeq2 will indeed do normalisation for you. Your PCA bi-plot should then be generated from the transfformed normalised counts. Have you looked through the DESeq2 Vignette?

0
Entering edit mode

Thank you for the script. I've tried it. I am not sure what happened, but by using the script, I have way more dots on the plot than actual samples:

PCA-plot-again

2
Entering edit mode

The prcomp variable names are somewhat misleading: When you use the rotated element, you are just plotting the component loadings; the x element, then, refers to the original data multiplied by the loadings (matrix multiplication). In one case, you will be plotting genes; in the other, samples. Usually, we want to plot the x element. I think that it is up to you to ensure that you know what you are doing.

In which orientation is your input data, MATRIX? If MATRIX has genes as rows and samples as columns, then your x element from prcomp will be plotting genes. If MATRIX has genes as columns and samples as rows, then your x element from prcomp will be plotting samples.

0
Entering edit mode

This reflects my limited knowledge about R (I'm not a bioinformatician). Sorry about that.

So the data matrix is as you wrote: Genes as rows, samples as columns. First row is name of samples.

1
Entering edit mode

I see - thank you. In that case, to do the PCA in the usual way, you should do:

pca= prcomp( t(MATRIX) , center=T, scale=T)
plot(pca$x[,1],pca$x[,2], xlab = "PC1", ylab = "PC2")
text(pca$x[,1],pca$x[,2], row.names(pca\$x), cex=0.5, pos=4)


I also developed a R package for PCA, but it is not yet released (but already heavily tested): https://github.com/kevinblighe/PCAtools

1
Entering edit mode

Alternatively you can just use DESeq2's plotPCA() function

1
Entering edit mode

Thank you for the responses. I will test these methods you suggested.