Question: Dropping treatments with low alignment from differential expression analyses?
gravatar for nsl24
8 weeks ago by
nsl240 wrote:

I have the downstream plan of performing differential expression analyses (using rDiff and Deseq2) on a data set containing three treatments which each have three replicates.

Two of the treatments aligned very well (>87% aligned unambiguously in STAR). One treatment, however, has a contaminant (which I know he Identity of) which makes up a majority of reads for that treatment. Only between 11% and 58% of the reads from the contaminated treatment align to my species of interest.

My gut says just to drop the contaminated treatment (which I can do and still have something to work with) because I have fewer reads from that treatment going in. Reading about how programs like DeSeq2 work, however, makes me think that maybe I could still include the contaminated treatment in the case of differential expression.

Can any of you kind people provide any advice on best practices/what is "okay" here?

rna-seq alignment • 184 views
ADD COMMENTlink modified 8 weeks ago by WouterDeCoster26k • written 8 weeks ago by nsl240
gravatar for WouterDeCoster
8 weeks ago by
WouterDeCoster26k wrote:

You only gave us the percentages, and not the total read numbers that do align. In general, software like DESeq2 can handle differences in library size (normalization corrects for that). So your "good" groups have 20M reads and your "badly aligned" group only 10M this shouldn't be too much of a problem.

Off course, if your "good" groups have 10M reads and your "badly aligned" group only 0.5M then lowly abundant genes will not be present in the badly aligned group and results will be biased to more abundant genes.

ADD COMMENTlink written 8 weeks ago by WouterDeCoster26k

Really be careful to bear in mind that if you choose to do this down-sampling ... you will basically be dumping lots of good data from the 2 good samples, just to make numbers equal with the bad sample. Now if the gene/locus you are interested in is commonly expressed, then a low read number will be sufficient. But...and I think this is the key...if you are trying to identify changes in genes that are expressed rarely....then you need all the data you can get...and in that the sample without the contaminant would be more cost effective.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by YaGalbi1.1k

Thanks for the warning Kenneth.

The "bad" treatment was a exploratory addition to the other two treatments (which are really my focus) and isn't really necessary, I'm just trying to see if it COULD work in its current form. Culturing the lines for the third treatment was difficult, getting rid of the contaminant would be really tricky, and since I don't need it I would probably just stick to dropping it if I need to.

I appreciate the feedback very much.

ADD REPLYlink written 8 weeks ago by nsl240

I don't say downsampling. But modest differences in library size is corrected for during normalization. That's not downsampling.

ADD REPLYlink written 8 weeks ago by WouterDeCoster26k

Yes to avoid confusion - downsampling would be in the situation you described where there is a vast difference in library size.

ADD REPLYlink written 8 weeks ago by YaGalbi1.1k

Thanks for the reply, and sorry for my leaving a gap in my description.

My good treatments range between 10M and 12M reads aligning, while my bad treatment ranges between 1.2M and 6M aligning. Does that change your perspective?

Do you have any suggestions for how to become better at making these sorts of calls? As I said, my gut pushed me to take the cautious route of dropping the treatment. Maybe it's just that this is only my second project and I haven't had to troubleshoot too much, but despite reading quite a bit I still hesitate when it comes to making these judgement calls.

ADD REPLYlink written 8 weeks ago by nsl240

We recently had a situation where of 2 samples ... 1 aligned poorly....after some investigation it turned out the sample sequenced was contaminated by the sequencing company....the original sample was fine and some was still left so we could re-sequence at no extra cost... a bit more carefully this time. I get the impression your sample was contaminated during the experiment. I'd be reluctant to use it.... but then I wont understand the biology as much as you will

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by YaGalbi1.1k

It doesn't hurt to analyse twice, with and without the bad libraries. But as noted by Kenneth we don't know about the biology of the contamination.

ADD REPLYlink written 8 weeks ago by WouterDeCoster26k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 646 users visited in the last hour