DESeq2 sizeFactor for Bacterial RNASeq
Entering edit mode
11 days ago
iXuan • 0

Hi Biostars community,

I am new to bioinformatics and are learning how to analyze RNA seq data for bacteria (E.coli specifically) for differential gene expression. I have some questions about normalization using DESeq2 after searching and reading on the forum. I am using data generated by others in the lab. Each condition includes 3 replicates. The RNA seq data has a large percentage of rRNA contamination despite of rRNA depletion. The input to DeSeq2 are the raw counts that are mapped to CDS (not including reads mapped to rRNA) from all conditions. My questions are:

  1. The total raw counts varies quite a bit between samples, ranging from 0.06 M to 9 M. This is shown in the boxplot of raw counts. Is this variation too great for normalization and comparison for DGE?

enter image description here

  1. Due to the large variable between samples, the sizeFactors are also all over the place, ranging from 0.02 to 9.04. My understanding is the sizeFactors should be close to 1. What is an acceptable range for sizeFactors? enter image description here

  2. After normalization, the sums of normalized counts between samples are closes to each other compared with the sums of raw_counts, but still I am not comfortable with the level of difference. Some as low as 1.43 M, while others as high as 5.9M. Can I even use this normalized counts for downstream DGE analysis? enter image description here

  3. PCA plot shows that the replicates of amp_6h samples cluster together, but the replicates of s_0h, untreated_3h, amp_3h scattered across PC1 or PC2. Is there any other processing I should try to make the data more interpretable prior to DGE analysis(such as remove the purple dot in the PCA that's far away from the other two purple dots )? enter image description here

  4. What are the possible causes for these data variation? Moving forward, what should we take into consideration when prepping samples for RNASeq to minimize within sample variation? Thank you so much for helping me clarify!

normalization bacteria rRNA RNASeq • 175 views
Entering edit mode
10 days ago

You could try downsampling to 650000 coding rears per sample.

Do that a few times, see how much variation you get in the PCA.

I'm concerned that either the stationary samples really have so little coding RNA that it will violate normalization assumptions to include them, or that they were prepped as a separate batch from the rest, making them totally useless.

The 6 hour samples elicit the same worries, to a lesser degree.


Login before adding your answer.

Traffic: 3274 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6