Question

How To Merge/Compare Rnaseq Data Using Different Protocal

1

Entering edit mode

10.2 years ago

shirley0818 ▴ 110

Hello,

I have RNA-Sequencing data generated from Illumina HT2000 for 14 samples from the same tissue. The technician applied the same protocol for these samples except for the last sequencing step by accident.

Batch 1 (8 samples) using 50bp paired-end

Batch 2 (6 samples) using 75bp paired-end

Since I have to analyze these sample together, what is the best way to avoid batch effect or protocol difference and reviewer's criticism?

Method1: align them separately, but merge them into one matrix using RPKM/FPKM for each gene and sample Method2: for batch2 samples, only use the first 50bp reads for alignment, ... Method3: re-sequencing Batch1 with 75bp paired-end protocol using the library left.

Many thanks, Shirley

rna-seq • 5.2k views

ADD COMMENT • link 10.2 years ago by shirley0818 ▴ 110

score 1 · Answer 1 · 2014-03-25

If you are doing differential expression have a look at the deseq2 vignette. 1.5 Multi Factor designs where they combine single and paired end reads.
Method2 with checks: error rate profile, correlation matrix and clustering, PCA etc... to at least check for batch effects. If the prepration was done without batch and its only the sequencing, and the sequencing was in all case qalitatatively good (i.e. no drops in certain cycles) the batch effect is neglegible, because illumina sequencing is highly reproducible. Some argue that even sample prep can be distributed among sequencing centers (http://www.nature.com/nbt/journal/v31/n11/full/nbt.2702.html)

score 0 · Answer 2 · 2014-03-25

If the library preps are the same, and the only difference is from the sequencing, then your cDNA and flowcell clusters should be the same. That is, the only difference is batch2 has more data, which you could trim off to make equivalent datasets.

Longer sequencing will give more accurate mapping, so if you have the money, redoing Batch1 at 75bp would be beneficial. (or do everything at 100bp for that matter)

Alternatively, your idea to just calculate FPKM should be equivalent, up to the mismapped shorter reads. In essence, the 50bp has some biases the 75bp cures, and trimming it back is not making the data better.

If Batch1 and Batch2 are all the same conditions, then pooling results should be fine, but if for example, one is control and the other test, then you need the biases to be equivalent, and trimming the reads of B2 is a good idea. Perhaps you should compute differential expression between the batches and see which genes come out (hopefully very few, and nothing after p-adjustment).

This all hinges on the assumption the library preps are equivalent. If one was size selected for 100bp fragments, and the other for 150bp fragments, of course you will have selected different mRNAs.

You might want to try a tool called RSEM, which calculates things like FPKM by mapping the reads to a transcriptome reference in ungapped fashion. This could further reduce sources of error from the differential alignment of the read batches. A tool like Tophat works on the genome, and will find more places to put a 50bp read than a 75bp read.

score 0 · Answer 3 · 2014-03-25

0

Entering edit mode

10.2 years ago

shirley0818 ▴ 110

Thanks for your detailed reply. Besides these two batches samples, in the near future, we will have more samples, and need to be compared with these two batches. So to be consistent and avoid potential reviewer's criticism, we might redo Batch1 at 75bp as you suggested.

Best, Shirley

ADD COMMENT • link 10.2 years ago by shirley0818 ▴ 110