Question

Downstream analysis with DEseq2 normalization

0

Entering edit mode

5 months ago

QX • 0

Hi All,

I am trying to integrate RNA-seq with Proteomics dataset. I used the DEseq2 for normalization dataset. I have 2 questions for next step:

I aim to compute z-score for RNA-seq data. I understand that the normalization from DEseq2 does not take the gene lengths into consideration, which mean that the genes that are longer in lengths could drive the distribution of normalized counts. For this reason, it may not be valid for further analysis on normalized data. Is this correct? if so, is there any step for dealing with gene lengths? Shall I perform another kinds of FBKM normalization on the DEseq2-normalized data?
For integration with proteomics, shall I: (1) normalize the two data first, then work on the overlapping genes/proteins, or (2) overlapping the two data first, and normalization the subset of these overlapping gene/proteins. I think the (2) approach make little to no sense for normalization as I may threw a lot of gene and proteins that are not overlapping out, which may affect the original distribution of the data.

Can anybody share some thought :)

Cheers,

DEseq2 integration • 771 views

ADD COMMENT • link 5 months ago by QX • 0

1

Entering edit mode

1) A Z-score subtracts the mean of the gene counts, so it does not matter how long a gene is.

2) Yes, normalize based on all genes, then subset on genes/proteins that have any counts/intensities in both datasets because only on these you can make statements. Based of evidence/detection is not evidence for absence/non-exrpression.

It will anyway come down to some sort of correlation-like analysis.

ADD REPLY • link 5 months ago by ATpoint 82k

0

Entering edit mode

HI ATpoint again, thank you for thought,

For 1), I think your answer only work if z-score is computed across samples for the same gene, right? if I want to investigate the expression of genes per sample, then it is needed for correcting the gene lenghts?

2) Yes I agree, however I still doubt that how we could know the effect of the gene and protein that we have removed (the ones that not overlapped). It could possible be that these gene and proteins retain high number of counts/ protein abundance in the cell and may provide some biological insight, but because they were not detected in the other techniques then being remove.

For correlation-like analysis, do you know any output rather than pearson metrics that I can use for? Is there any other ways rather than correlation-like analysis for omics integration that you can think of?

ADD REPLY • link 5 months ago by QX • 0

2

Entering edit mode

If by Z score you mean, for gene g in sample n, with counts counts[g,a] :

Z[g,n] = (counts[g,n] - mean(counts[g,]))/sd(counts[g,])

then the size factor is not important. However, if by Z score you mean

Z[g,n] = (counts[g,n] - mean(counts[,n]))/sd(counts[,n])

Then length normalisation is important. Furthermore, I'm not entirely sure how valid DESeq normalisation is in this case (although I don't know what a better one would be).

I've not come across a better correlation metric than pearsons (and not for want of looking).

ADD REPLY • link 5 months ago by i.sudbery 19k

0

Entering edit mode

Hi @i.sudbery,

Thank for clarification, I would think the same on z-score for count data.

I found a paper on other way of integration from Saad Haider et al., but I think correlation analysis is still better to start with

ADD REPLY • link 5 months ago by QX • 0