Question

When identifying DEGs, should all the libraries be normalized together?

1

Entering edit mode

23 months ago

Estefania ▴ 30

Hello, I am writing an RNA-seq manuscript with a fellow lab member. I did everything except the bioinformatics. We have seven different tissues and three biological replicates for each, so a total of 21 libraries. We are comparing four of the tissues for a specific analysis. For this analysis, she normalized the counts of those 12 libraries (excluded the 9 others) and then identified DEGs. I suggested that she normalize all the libraries together, and then identify DEGs by doing contrast between all the libraries in the specific analysis (so the 12 libraries in this case).

She did both analyses and found that when comparing for example tissue A to tissue B, 1,221 (79%) genes were identified as upregulated by both normalization methods, 257 (16.6%) were identified by subset-sample-based (i.e., only the 12 libraries) normalization, and 67 (4.3%) were only identified by all-sample-based normalization. And she argues that "by normalizing with the samples/libraires of interest, we were able to identify more DEGs, which covered 94.5% of the DEGs identified by all-sample-based normalization. So, if we report subset-sample-based normalization results, we are missing 5% DEGs that can be identified otherwise; if we report all-sample-based normalization results, we are missing 17% DEGs that would be reported otherwise."

I think that the DEGs identified only by subset-sample-based normalization are likely not real DEGs because they are likely expressed in multiple tissues. The samples we included are not all the tissues in the plant we used, but the more tissues the more complete the picture, correct?

DEG Normalization • 950 views

ADD COMMENT • link updated 23 months ago by swbarnes2 14k • written 23 months ago by Estefania ▴ 30

1

Entering edit mode

Using the number of DEGs as proof of method efficacy is not necessarily a good idea. For example, you get less DEGs with log fold shrinkage, but this is due to the reduction of bias for genes with lower counts to have higher log fold changes.

As part of estimating the variance of a gene, a component of that variance is calculating using all samples, which often results in having more accurate variance estimates the more samples you include. You could be getting less DEGs for example because more accurate variance estimates are removing false positives.

The ideal workflow would be using all samples in the model, use log fold shrinkage when using the DEGs, and setting a fold change threshold when making the contrast. This will generally result in the highest quality data. Exceptions to this would be if the samples are from vastly different sources where the counts are expected to be very different for most genes.

As a side note, if you haven't done so, run a PCA analysis just to make sure samples are separating as expected. Another case could be that outlying samples are reducing your power to detect DEGs with the full dataset.

ADD REPLY • link 23 months ago by rpolicastro 13k

0

Entering edit mode

Thank you; this is very helpful.

We did a PCA and the samples separated as expected by tissue.

ADD REPLY • link 23 months ago by Estefania ▴ 30

score 1 · Answer 1 · 2022-05-12

1

Entering edit mode

23 months ago

swbarnes2 14k

In general, you should include all the samples, but if some of them are wildly different, then the base assumptions of the normalization algorithm might be violated.

Normalizing totally different tissues to each other might not be appropriate.

I think that the DEGs identified only by subset-sample-based normalization are likely not real DEGs because they are likely expressed in multiple tissues

Why would a gene not be a DE gene if it's expressed in multiple tissues?

ADD COMMENT • link 23 months ago by swbarnes2 14k

0

Entering edit mode

I guess DE is not the right descriptor. If it is in multiple tissues, it can still be DE if its expression is significantly different between the two tissues we are comparing, correct? It won't be a tissue-specific gene I guess is more appropriate. There probably aren't many tissue-specific genes.

We are working with flowers, fruits, pollen, and leaves. There are three developmental stages among the flowers and two among the fruits. The pollen and leaf samples are drastically different from the flowers and fruits. For part of the analysis, we want to look at genes that are DE only among flowers and fruits. For that analysis, it is then not recommended to include pollen and leaf? since they are so different? (would you mind reminding me the assumptions of the normalization algorithm)

Thanks!

ADD REPLY • link 23 months ago by Estefania ▴ 30

0

Entering edit mode

I think the argument that "finding more DE genes means that his way is better" is not that sound. It could be that some of those extra genes are false positives.

But I'm not at all confident that adding totally different tissues is going to help to get more accurate library size normalization or more accurate dispersion estimates. More is not always better.

ADD REPLY • link 23 months ago by swbarnes2 14k