Question

MA plots with similar distribution but display difference in significantly expressed genes

1

Entering edit mode

3.0 years ago

fufor94 ▴ 10

Hi all, Here, I have two MA plots. They display similar patterns but looking at which genes are significantly expressed, they seem to differ. The one on the left has reads < 5 million while that on the right has more reads >= 5 million. I am thinking of filtering out strains with reads < 5million just because I somehow "believe" if I had more reads for the one on the left, then the probability of having more DEGs will be higher. I am hoping for other opinions on this matter yours will be greatly appreciated. Thanks

MAplots

reads MAplot DEG rna DESeq2 • 1.4k views

ADD COMMENT • link updated 3.0 years ago by seidel 11k • written 3.0 years ago by fufor94 ▴ 10

0

Entering edit mode

I am just curious, why are you dividing the data into two plots? Can you explain a bit more about the experimental design?

ADD REPLY • link 3.0 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Hi Giovanni, In summary I have over 100 strains of E.coli, and am comparing their gene expression profiles with a reference strain (K-12). So my goal is to identify which genes are differentially expressed among these strains when compared to my ref strain (NT12001). The plots show here are from different strains compared with the ref.

ADD REPLY • link 3.0 years ago by fufor94 ▴ 10

score 0 · Answer 1 · 2021-04-09

Forget about number of reads unless you can prove to yourself otherwise. How many replicates do you have? I highly doubt the easy answer is sequencing depth and rather has to do with variance in the data. I'm assuming your reference strain is in the denominator of each plot. On the right it looks like there is a set of genes expressed in the reference, that are NOT expressed in the test strain, and reproducibly so. On the left, that same set of genes may be more highly variable in the test strain (disagreement between replicates). If this is the case, you won't get good p-values, and you won't be able to solve that problem with sequencing depth - especially given that these genes are falling in the midline of your count distribution. Have you examined a correlation plot of all the data sets against each other? Do your replicates on the right have high correlation and those on the left have lower correlation with each other? As an experiment, you can test your hypothesis that depth will help, by sampling the data for the right hand plot (3 million reads, 2 millions reads, etc.) - your prediction is that as you lower the sequencing depth, it will turn into the plot on the left. I think you'll have to go really low before those DEGs on the right disappear. Take a handful of genes from each plot, in the region you expect to be clearly DEG, and examine a bar plot of all the replicate values. I think you'll see the ones on the right are tight, and the ones on the left are highly variable (thus not passing a statistical cutoff). And FWIW, these are HUGE ratios, the blue dots on the right are 30-1000 fold. Biologically this might be telling you something, if there is a set of genes expressed in your reference strain that is stochastically expressed in the other strains.