Question

Dropping treatments with low alignment from differential expression analyses?

0

Entering edit mode

6.3 years ago

nsl24 • 0

I have the downstream plan of performing differential expression analyses (using rDiff and Deseq2) on a data set containing three treatments which each have three replicates.

Two of the treatments aligned very well (>87% aligned unambiguously in STAR). One treatment, however, has a contaminant (which I know he Identity of) which makes up a majority of reads for that treatment. Only between 11% and 58% of the reads from the contaminated treatment align to my species of interest.

My gut says just to drop the contaminated treatment (which I can do and still have something to work with) because I have fewer reads from that treatment going in. Reading about how programs like DeSeq2 work, however, makes me think that maybe I could still include the contaminated treatment in the case of differential expression.

Can any of you kind people provide any advice on best practices/what is "okay" here?

RNA-Seq alignment • 1.3k views

ADD COMMENT • link updated 6.3 years ago by WouterDeCoster 47k • written 6.3 years ago by nsl24 • 0

score 1 · Answer 1 · 2018-01-13

1

Entering edit mode

6.3 years ago

WouterDeCoster 47k

You only gave us the percentages, and not the total read numbers that do align. In general, software like DESeq2 can handle differences in library size (normalization corrects for that). So your "good" groups have 20M reads and your "badly aligned" group only 10M this shouldn't be too much of a problem.

Off course, if your "good" groups have 10M reads and your "badly aligned" group only 0.5M then lowly abundant genes will not be present in the badly aligned group and results will be biased to more abundant genes.

ADD COMMENT • link 6.3 years ago by WouterDeCoster 47k

1

Entering edit mode

Really be careful to bear in mind that if you choose to do this down-sampling ... you will basically be dumping lots of good data from the 2 good samples, just to make numbers equal with the bad sample. Now if the gene/locus you are interested in is commonly expressed, then a low read number will be sufficient. But...and I think this is the key...if you are trying to identify changes in genes that are expressed rarely....then you need all the data you can get...and in that case....re-sequencing the sample without the contaminant would be more cost effective.

ADD REPLY • link 6.3 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

Thanks for the warning Kenneth.

The "bad" treatment was a exploratory addition to the other two treatments (which are really my focus) and isn't really necessary, I'm just trying to see if it COULD work in its current form. Culturing the lines for the third treatment was difficult, getting rid of the contaminant would be really tricky, and since I don't need it I would probably just stick to dropping it if I need to.

I appreciate the feedback very much.

ADD REPLY • link 6.3 years ago by nsl24 • 0

0

Entering edit mode

I don't say downsampling. But modest differences in library size is corrected for during normalization. That's not downsampling.

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k

0

Entering edit mode

Yes to avoid confusion - downsampling would be in the situation you described where there is a vast difference in library size.

ADD REPLY • link 6.3 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

Thanks for the reply, and sorry for my leaving a gap in my description.

My good treatments range between 10M and 12M reads aligning, while my bad treatment ranges between 1.2M and 6M aligning. Does that change your perspective?

Do you have any suggestions for how to become better at making these sorts of calls? As I said, my gut pushed me to take the cautious route of dropping the treatment. Maybe it's just that this is only my second project and I haven't had to troubleshoot too much, but despite reading quite a bit I still hesitate when it comes to making these judgement calls.

ADD REPLY • link 6.3 years ago by nsl24 • 0

1

Entering edit mode

We recently had a situation where of 2 samples ... 1 aligned poorly....after some investigation it turned out the sample sequenced was contaminated by the sequencing company....the original sample was fine and some was still left so we could re-sequence at no extra cost... a bit more carefully this time. I get the impression your sample was contaminated during the experiment. I'd be reluctant to use it.... but then I wont understand the biology as much as you will

ADD REPLY • link 6.3 years ago by BioinfGuru ★ 1.7k

0

Entering edit mode

It doesn't hurt to analyse twice, with and without the bad libraries. But as noted by Kenneth we don't know about the biology of the contamination.

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k