I am working with a collaborator who is doing RNAseq with biopsy samples of different infected tissues in the host. She is doing a pilot study comparing three tissues and is interested in differential expression in the parasite. As these are taken directly from host there is obviously a lot of host contamination. She has sequenced deeply (around 200 million reads per sample - for the same parasite in culture 60 million is good) to make sure she gets enough parasite material to be useful. I have aligned her data against the host and parasite genomes together, and then counted reads aligning to parasite genes.
Of her three tissues, one is the canonical environment for this parasite. The other two are sites where parasites may sequester leading to recrudescence after treatment. The parasite burden is very much reduced in the latter sites compared to the former. A consequence of this is that the total number of reads mapping to the parasite genome is very different across the three tissues - approximately 3 orders of magnitude between the extremes.
So my questions are:
- Can standard normalisation methods account for differences of this magnitude? Surely there must be a limit to what they can cope with?
- If not, what is the best way to cope with this in the pilot data we already have? Downsampling?
- What is the best way to cope with this in future samples? Carry on with what we have done in the pilot, or adjust the input to try to get more even coverage?
Any thoughts appreciated!