How to normalize and transform RNA-seq data of different samples for PCA?
1
2
Entering edit mode
3 months ago
Pratik Mehta ▴ 590

Hi Biostars Community,

I have searched the forum, but couldn't find a perfect answer for this.

I have downloaded some RNA-seq data from GEO (GSE112656). They are basically 11 osteoarthritis samples and 11 rheumatoid arthritis (so 22 in total).

I also have 4 samples of RNA-seq data generated from my lab with three technical replicates (so 12 total).

1. Drug A - treated
2. Drug A - control (untreated)
3. Drug B - treated
4. Drug B - control (untreated)

I want to use the GEO (GSE112656) and also my lab data to conduct PCA to see which ones are similar and cluster together and which ones are different.

How would I normalize and transform these datasets before following these directions:

https://bioconductor.org/packages/devel/bioc/vignettes/PCAtools/inst/doc/PCAtools.html

I was considering following this post: Which counts to use for RNA-seq heatmap and PCA?

Basically, I am planning to combine all the samples (22 +12 = 32 samples) into one large data frame and then generate log2 transformed TMM followed by throwing that table into a PCA function to visualize the PC_1 by PC_2 table, or maybe PC_1 by PC_2 by PC_3 to seeing how they cluster/how similar or different they are to one another. I am planning TMM because HBC recommends this to compare between samples and within samples. Would all this be best practices?

Please, if you can suggest something better, I would appreciate it.

TMM PCA RNA-seq edgeR • 488 views
4
Entering edit mode
3 months ago

Hey mate,

I would process both datasets independently to produce variance-stabilised expression levels via DESeq2, bind these together, and then eliminate the [assumed] batch effect via limma::removeBatchEffect(). It is the resulting batch-corrected, column-bound, variance-stabilised expression matrix on which I would conduct PCA, with scale = FALSE (but center = TRUE).

No issue going the EdgeR / limma-voom routes, either, in which case use log2 CPM+1 in place of variance-stabilised expression levels.

Kevin

# @@

Thanks for up-votes and accepting the answer. It would also be interesting to process all samples together from the beginning, and then still remove batch on the variance-stabilised expression matrix that is produced - see https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#why-after-vst-are-there-still-batches-in-the-pca-plot

0
Entering edit mode

Thank you Kevin

You always swoop in to save the day : )

I will try this. I just tried the whole procedure using: design = ~1. Result were ehh, but that could just be what they are lol... Now I think i will try to get "fancy" with the design and use actual ones for each of the datasets. They aren't too fancy... one dataset is either OA or RA so design = ~ cell, and the drug A and drug B datasets are similar to design = ~dex from the DESeq2 manual.

And then finally I will try with edgeR just log2 CPM+1 each dataset individually, cbind (column-bind), and then slap that table through PCAtools.

Thank you again. I strive to be as compassionate as you, Kevin, (you've helped so many people on here and other places, and you continue to) : )

I tried to be explicit in everything here so this post would serve as a good resource - please don't think I think you don't know cbind, dex, etc...

1
Entering edit mode

If one dataset is RA while the other is OA (I have worked and published in both areas), then it is difficult (even impossible) to remove any effect of batch due to the fact that batch is confounded with disease (OA or RA). If this is the case, I would process each data independently to the VST stage, and then transform these datasets independently to Z-scores. Then, at least, the datasets are comparable.