Question

Cleaning the expression matrix for the purpose of deconvolution of RNA-Seq data

1

Entering edit mode

6.5 years ago

jane.merlevede ▴ 20

Dear all,

I wonder about the cleaning of rna-seq counts in the context of tumor deconvolution based on rna-seq data.

When performing classic rna-seq differential expression analysis, it is common to remove the genes that are not and almost not expressed across the samples, leading to the removal of ~5-30% of the genes I would say. This filtering step is the only one I am aware of that is commonly performed for DEA. I am interested in the question of tumor deconvolution. In this context, one starts with an expression matrix. I am wondering if this matrix should be preprocessed (more extensively than just removing lowly expressed genes) to remove non informative genes and potential noise.

I recently had an introduction to the analysis of methylation data, what to do once you have the percentage of methylation per CpG. Some people remove CpG that have little variance across samples (mainly unmethylated CpGs, but not only), CpGs on X, Y chromosomes and also try to see if the methylation of certain CpGs correlates with clinical variables (when available) (like age, gender, ...) to filter or adjust them.

I wonder why there are many criteria on methylation data that are not (to my knowledge) use on rnaseq data. Do you know why and do you think they should be use for the question of tumor deconvolution based on rna-seq data?

Thank you in advance for your comments. Jane

RNA-Seq deconvolution preprocessing • 2.7k views

ADD COMMENT • link updated 2.3 years ago by Kevin Blighe 89k • written 6.5 years ago by jane.merlevede ▴ 20

0

Entering edit mode

Sorry, I just saw your answer. Yes I mean identification of cell populations in bulk by tumor deconvolution.

Using pure cell populations is an interesting approach, used in supervised methods, as CIBERSORT, EPIC, xCell, ... on gene expression data. I assume that when using supervised approaches, the "cleaning of the dataset" might have less importance than when using unsupervised approaches. I can use both approaches in parallel, but my focus here is mainly on unsupervised methods. That is why I would like to start with a clean and meaningful matrix.

Z-scores are intuitive. What I cannot figure out is (if/why) their use might improve in some way the analysis, besides the interpretation.

ADD REPLY • link 6.4 years ago by jane.merlevede ▴ 20

score 3 · Answer 1 · 2019-01-18

3

Entering edit mode

6.5 years ago

Kevin Blighe 89k

Hi Jane,

With regard to the question on why different criteria are applied to different types of data, these methods / criteria are typically developed by different groups who have different backgrounds; so, differences in how they process data is obvious. One must also take into account that the analytically-measured data distribution and type of error can differ between, say, RNA-seq and methylation. Even when considering the same type of data, there are many variations on analysis methods, one invariably claiming it's superiority over the others.

I would regard a RNA-seq analysis that just includes a simple differential expression analysis and, for example, heatmap, as 'very basic'. Indeed, clinical variable correlations and the construction of regression models including both clinical, RNA-seq, and other data are routinely used by some in the field.

Even the elimination of genes based on low variance is employed by some, in some circumstances. For DEA analysis in RNA-seq, specifically, I would not remove genes based on low variance prior to performing this. The normalisation method used by EdgeR and DESeq2, for example, specifically aim to model and control for dispersion.

The way in which you prepare your data will depend on how you wish to perform the tumour deconvolutiion. The program / algorithm that you use may expect data that follows a normal 'bell curve' distribution, for example, whereas RNA-seq raw and normalised counts follow a negative binomial distribution - you thus have to further transform these normalised counts in order to use them for most downstream applications.

A further transformation of the normalised, transformed counts to the Z scale may prove useful. Z-scores are quite intuitive and have much utility in the realm of deconvolution, as to which I somewhat allude here (and elsewhere): Normalizing transcriptome data by tissue type

Kevin

ADD COMMENT • link 2.3 years ago by Kevin Blighe 89k

0

Entering edit mode

Hello Kevin,

Thank you for your answer.

I agree with you, for DEA in RNA-Seq, the preprocessing consisting in the suppression of lowly expressed genes only seems now a gold standard and adapted to the normalization step that follows.

My concern is rather why a person would perform preprocessing on RNA-Seq data for DEA as we mentionned and preprocessing on methylation data for DMR with more extensive criteria. I imagine that there are people processing the 2 types of data in different ways. As you mentioned, this is probably because of the nature of the data: the ranges of beta values and normalized counts are completely different, as their distributions, so it can make sense to use different approaches.

Good point, I will check what kind of distribution except the methods for tumor deconvolution, I do not know yet, I just started on this question. Could you recommend other types of transformations (in addition to Z-scores) to try?

Finally, could you please explain why do you think Z-scores are particularly interesting in the field of deconvolution?

Thank you, Jane

ADD REPLY • link 6.5 years ago by jane.merlevede ▴ 20

2

Entering edit mode

Hey, a quick doubt entered my head: could you just clarify what you mean by 'tumour deconvolution'? In my mind, it means identifying different cell populations in the tumour bulk biopsy. There are different definitions, depending on to whom you talk.

If it is about the identification of cell populations, then another obvious approach is:

obtain 'pure' cell populations and process these by cDNA microarray or RNA-seq
construct predictive regression models that can identify each pure cell population with high sensitivity/specificity. As an example, pure B-cells would likely be easily identified by high expression of MS4A1 (CD20), CD79A, and CD79B
apply the model to your unknown test data in order to predict the cell populations in it

Z-scores are quite intuitive because a score of +1 implies that the gene is more highly expressed by 1 standard deviation (sdev) from the mean expression value. +3 indicates high expression by 3 sdevs. -1 is 1 sdev below the mean. As a practical example: you could take all genes with Z > +2 in a dataset in order to identify the genes most highly expressed.

ADD REPLY • link 6.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Sorry, I just saw your answer. Yes I mean identification of cell populations in bulk by tumor deconvolution.

Using pure cell populations is an interesting approach, used in supervised methods, as CIBERSORT, EPIC, xCell, ... on gene expression data. I assume that when using supervised approaches, the "cleaning of the dataset" might have less importance than when using unsupervised approaches. I can use both approaches in parallel, but my focus here is mainly on unsupervised methods. That is why I would like to start with a clean and meaningful matrix.

Z-scores are intuitive. What I cannot figure out is (if/why) their use might improve in some way the analysis, besides the interpretation.

ADD REPLY • link 6.4 years ago by jane.merlevede ▴ 20