Question

Combining Illumina Ga And Hiseq Microrna Sequencing Data-Sets

2

Entering edit mode

11.1 years ago

alpha2zee ▴ 120

I have a set of ~100 samples with microRNA sequencing data obtained using Illumina Genome Analyzer, and another set of ~200 samples with the data obtained using Illumina HiSeq 2000. The total ~300 samples belong to two groups, equally represented in the two data-sets. I am interested in differential expression analysis to compare microRNA expression between the two groups.

I want to combine the GA and HiSeq data (available as either absolute read counts and counts per million reads) to have a larger sample-size for the analyses.

The GA and HiSeq 2000 platforms use the same 'chemistry', and I understand that the main difference between them is that the latter has a higher throughput (processing time), so combining the data obtained with the two different platform seems reasonable.

Can anyone advise if this is indeed so? Further,

(1) Should one use the absolute read count values, or the count per million values?

(2) Should I normalize the data after combining the data-sets? What method will be appropriate?

(3) How should missing values be dealt with? E.g., unlike in the HiSeq data-set, microRNA miR-X may not have been detected in any sample of the GA data-set (and thus missing in it).

Thank you.

illumina sequencing normalization rnaseq • 5.7k views

ADD COMMENT • link 11.1 years ago by alpha2zee ▴ 120

score 2 · Answer 1 · 2013-03-16

To be maximally cautious, I would test the hypothesis that the GA and HiSeq counts are equivalent before combining datasets. You could do this empirically by making negative control comparisons of samples in the same group but across sequencing platforms.

You could do this in a more principled way by using a model that allows for batch effects, where the batch here is the sequencer used. edgeR can do this nicely. For more statistical details, see their paper and this one. I think DESeq can do this multilevel modeling as well, but I haven't personally used it for that task. There are some questions here relating to this on Biostars, Gene-Level Analysis Of Rna-Seq Matched Pairs Of Samples?.

For your specific questions:

Typically it's better to use a probabilistic model that directly accounts for the discrete (count) nature of the data, in which case using raw counts is preferred. See the papers I mentioned for more discussion.
This is an area of active research. Both edgeR and DEseq have methods for this, which should be adequate in most situations (better than RPKM).
I'm not sure what you mean here. Counts of 0 should be handled perfectly fine by using a good count-based model, like edgeR and DEseq do.

score 0 · Answer 2 · 2013-03-23

Various analyses that I recently perform suggest that the data can indeed be combined.

As I mentioned, I have two different data-sets ('GA' and 'HiSeq'). For assessing the feasibility of combining the data, I start with data-sets of absolute count values. Within each data-set, the count values are normalized by trimmed means of M value (TMM) normalization using the calcNormFactor() function of R, to obtain two new data-sets with count-per-million values. The new data-sets can then be compared after retaining only the common variables (rows; microRNAs). Comparisons that can be tried are inter-platform (GA vs HiSeq) XY-plots of mean or median values for the microRNAs for the same study group, and plots for multi-dimensional scales (like principal components). E.g.,

y <- DGEList(counts=HiSeq) # absolute counts
y <- calcNormFactors(y) # default TMM normalization
HiSeqNormed <- cpm(y, normalized.lib.sizes=T) # get normalized count-per-million values; similarly, get GANormed
temp <- intersect(rownames(HiSeqNormed), rownames(GANormed)) # common variables/microRNAs
plot(log2(apply(HiSeqNormed[temp, HiSeqSamples], 1, mean)), log2(apply(GANormed[temp, GASamples], 1, mean)), xlab='HiSeq', ylab='GA', main='log2(mean) microRNA counts-per-million')