Question

Should I cluster and PCA the whole sample or only the genes of interest?

0

Entering edit mode

3.2 years ago

Omar Mohamed • 0

I want to analyze data from TCGA for all the cancer and normal tissue types to be scaled and normalized at once. After that, I tried to do PCA to visualize different variables and see if there is a batch effect and how different tissues and diagnoses would cluster, but the amount of genes and studies are too much and it takes too much time to investigate how the PCA would look like with a single study variable, and I have about 20 potiential covariate and batch factor!

Should I take the PCA seriously as recommened by most workflows? Can I filter only the genes of interset and ignore all the of the variablity of other genes?

I am asking about subseting genes cause I've read in another blog that It is recommened to keep all of the genes in the normalization. I am following emperical bayes normalization using limma-voom workflow

RNA-Seq Normalisation • 1.3k views

ADD COMMENT • link updated 3.2 years ago by swbarnes2 14k • written 3.2 years ago by Omar Mohamed • 0

score 2 · Answer 1 · 2021-01-29

2

Entering edit mode

3.2 years ago

swbarnes2 14k

The DESeq recommended workflow is to work out the variance for each gene, and do PCA on the 500 or 1000 or so genes with the most variance. That seems sound here. The genes with less variance won't be contributing much to differences between samples.

ADD COMMENT • link 3.2 years ago by swbarnes2 14k

0

Entering edit mode

Thank you so much, that was insightfull!

ADD REPLY • link 3.2 years ago by Omar Mohamed • 0

score 1 · Answer 2 · 2021-01-29

1

Entering edit mode

3.2 years ago

Papyrus ★ 2.9k

If you want to know about batch effects, which are often methodological, or in general any covariate, I think you should investigate on all of the genes because those type of biases will be present in all of your genes (i.e. your genes of interest a priori should not be biased towards being more or less affected by batches, etc.), and thus you will have more information (in fact, there are algorithms which precisely rely on looking at the "rest of genes" outside the comparison to model variance of unknown covariates)

If you have many variables and it becomes hard to look at all the PCA plots, you can screen them by performing correlations between the variables and the principal components. Plotting these correlations (in heatmap for example) will give you a quick look on how your experimental variables are associated to the PCs. As an example of what I mean, see the pcrplot in the ENmix package (section 11 of the user's guide)

ADD COMMENT • link 3.2 years ago by Papyrus ★ 2.9k

0

Entering edit mode

Thanks alot! But I was wondering what do you mean by algorithms to look at the rest of genes, this is not clear?

ADD REPLY • link 3.2 years ago by Omar Mohamed • 0

1

Entering edit mode

Sorry, I just mentioned that for the sake of explaining why "information" about covariates such as batches may be in genes which are not the subset of genes of interest (for example see the RUVSeq package). But this is not directly related to your question, I did not mean to be confusing. These are methods used to correct for covariates when performing the differential testing.

Mi answer was addressed to transmitting that you should not use your particular set of "genes of interest" (related to your comparison/treatment/disease) which may be too small and non-representative, to investigate the possible influence of variables in your data (and if you want to do a blind analysis, the selection of a set of genes of interest may bias the variables you see associated to them). Because any gene, not just those of interest, will have information.

Nonetheless, swbarnes2 is absolutely right in their answer that the top 500-1000 most variable are enough to do the PCA, and that will indeed be faster than using all of the genes.

ADD REPLY • link 3.2 years ago by Papyrus ★ 2.9k