Question

Joining RNASeq librairies from different experiments

0

Entering edit mode

8.2 years ago

Joel TM ▴ 60

Good day to all, I am looking for insights about how to approach my issue. I am sure some of you have gone through this step at some point or perhaps there are related posts here I couldn't find that you know about.

I have libraries from Lung tumor samples that have between 100M-200M reads each. I want to test for differential expression with normal/healthy lung samples. I found a public RNAseq data for the latter but it comprises of libraries of between 15M-20M reads.

Would that kind of analysis/comparison be reliable at all? If so, what is the best way to approach this?

Thank you for the mentorship,

Regards,
Joel

RNA-Seq differential-expression Normalization • 1.7k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.2 years ago by Joel TM ▴ 60

Ram · Answer 1 · 2016-01-24

As Goutham states, if it's just a library size issue, then most sequencing normalisation methods account for that (see sizeFactors in DESeq2's manual for more information). However, it's not normally that simple when combining data from different experiments, often there are differences in chemistry, sample prep, instrument, day of the week, temperature in the room, etc which add variation, often known as 'batch effects'.

Providing that your samples across experiment are of the same type, i.e. in experiment A you have healthy / tumour samples, and in experiment B you have healthy tumour samples, then you can account for that variation using an additive model. Even with just tumour or healthy samples in A or B, you could block by experiment and make the variance estimation, but not as reliably.

Basically all this comes down to how you design your model. I'd recommend the DESeq2 workflow -> Align with your favourite splice aware aligner, Count using htSeq_Count or RSubRead, then follow the DESeq2 vignette

score 1 · Answer 2 · 2016-01-24

The normalisation methods accounts for the variation in read depth ( library size ) and works pretty well up to 10 fold difference. But you can always check the clustering or PCA plots to have an idea how the samples look. I am not sure about other artefacts likes batch effects.

To put in another way, if you are concerned only about different library sizes, it will be taken care by normalisation methods.