Question

RNA-seq - Differential expression with very different sequencing depths

0

Entering edit mode

6.0 years ago

crouch.k ▴ 30

I am working with a collaborator who is doing RNAseq with biopsy samples of different infected tissues in the host. She is doing a pilot study comparing three tissues and is interested in differential expression in the parasite. As these are taken directly from host there is obviously a lot of host contamination. She has sequenced deeply (around 200 million reads per sample - for the same parasite in culture 60 million is good) to make sure she gets enough parasite material to be useful. I have aligned her data against the host and parasite genomes together, and then counted reads aligning to parasite genes.

Of her three tissues, one is the canonical environment for this parasite. The other two are sites where parasites may sequester leading to recrudescence after treatment. The parasite burden is very much reduced in the latter sites compared to the former. A consequence of this is that the total number of reads mapping to the parasite genome is very different across the three tissues - approximately 3 orders of magnitude between the extremes.

So my questions are:

Can standard normalisation methods account for differences of this magnitude? Surely there must be a limit to what they can cope with?
If not, what is the best way to cope with this in the pilot data we already have? Downsampling?
What is the best way to cope with this in future samples? Carry on with what we have done in the pilot, or adjust the input to try to get more even coverage?

Any thoughts appreciated!

RNA-Seq • 1.3k views

ADD COMMENT • link 6.0 years ago by crouch.k ▴ 30

0

Entering edit mode

Can standard normalisation methods account for differences of this magnitude? Surely there must be a limit to what they can cope with?

Do it and check normalization visually by MA plot, that is probably the easiest way to tell. Also do that with subsampled sampes and see if it looks different/better.

ADD REPLY • link 6.0 years ago by ATpoint 88k

0

Entering edit mode

Another thing you should check if the samples with low depth show signs of inflated zero counts so if genes that are lowly-expressed or short (and therefore gets lower counts) tend to have no counts, e.g. by plotting histograms and comparing them to the high-depth samples. If this is not the case (so hists are somewhat similar) I think that the standard normalization techniques such as TMM from edgeR could still handle the situation.

ADD REPLY • link 6.0 years ago by ATpoint 88k

0

Entering edit mode

Thanks for your help!

So, the MA plots have kind of long tails but don't look too terrible (blood is the high-depth sample):

MA Plot brain vs blood

MA Plot skin vs blood

However, the (raw) count frequency distributions look very different and I am not convinced by the distribution of normalised counts, especially for the brain samples.

Boxplot of normalised counts Frequency histogram of read counts

Cluster upgrades at the moment so downsampling hasn't run yet, but I'll try that when it has. Normalisation was done with DESeq2. Any further advice?

ADD REPLY • link 6.0 years ago by crouch.k ▴ 30

0

Entering edit mode

These long tails are normal is you use apeglm as (this is what I remember from what Mike Love said in a BioC thread) it tends to shrink small counts very close to zero, much more than the normal shrinkage estimator did.

ADD REPLY • link 6.0 years ago by ATpoint 88k

0

Entering edit mode

Ah, yes - I did use apeglm so that would explain the tails.

I am still not convinced by the distribution of normalised counts in the brain samples (boxplot) though. I think from this I would be happy enough to contrast blood and skin while expecting more dropouts than usual as a limitation of this kind of experiment, but I'll find out if the group can rustle up some money to sequence the brain samples a little deeper.

Thanks for the discussion!

ADD REPLY • link 6.0 years ago by crouch.k ▴ 30