Question

Normalization of NGS data generated from different platforms of microRNAs

0

Entering edit mode

4.3 years ago

skjobs ▴ 190

I have read count of microRNAs NGS data generated from different platform such as Illumina NextSeq 500 and HiSeq 2000. Do I need to normalized it before differential expression analysis?. Which is the best of method to normalise data before DEs. I thought the data is generated from two different machines,

rna-seq sequencing next-gen R • 1.4k views

ADD COMMENT • link updated 4.3 years ago by i.sudbery 19k • written 4.3 years ago by skjobs ▴ 190

1

Entering edit mode

I would check first by PCA if there is an obvious batch effect. Typically different Illumina platforms are very similar and do not influence the result dramatically. You can normalize the data by vst from DESeq2 and then use plotPCA. If there is no evidence for batch effect perform normal DEG analysis as in the manuals of established DEG tools.

ADD REPLY • link 4.3 years ago by ATpoint 82k

score 2 · Answer 1 · 2020-01-05

Depends what samples were sequenced on what machines. NextSeq 500 and HiSeq are similar technologies, so the effects may not be too strong. Hopefully you have samples from each of the conditions sequenced on each of the machines. In which case you can just include the machine as a co-variate in your linear model.

So if you had a sample table that looked like:

Sample    Condition    Machine
1         Control      Hiseq
2         Control      Hiseq
3         Control      Nextseq
4         Control      Nextseq
5         Treatment    Hiseq
6         Treatment    Hiseq
7         Treatment    Nextseq
8         Treatment    Nextseq

Then you could use the design formula ~ Machine + Treatment to correct for the effect of the different machine, the same as any other batch effect. In the ideal world you design you be perfectly balanced, like in the example above, but you should get some benefit from this as long as your design isn't perfectly confounded (i.e. all the controls on one machine, all of the treatments on the other).

You can look for the effect of batch using dimensional reduction, such as MDS or PCA. You'll be looking for whether the samples cluster by batch or by condition. MDS is the easiest, but it can be difficult to interpret. If you are going to use PCA, you'll want to do some sort of variance stabilization, like DESeq2::rlog or DESeq2::vst first, but you can then look for your batch effects in more than just two dimensions.

If you do have perfect confounding, then there is not much you can so (consider the an MDS plot - clustering by machine and clustering by condition are the same thing, how could you, or any statistical method tell the difference?). Given the similarity of the technologies, the results of this DE might still be indicative in the absence of any correction, but I'd be nervous about basing any conclusion solely on this evidence.

Note all the above technically applies not just to analysis done on two different types of machine, but also on two different machines of the same type, or even two different lanes on the same machine.