Question

Need Help For The Study Design Of A Rna-Seq Project

10

Entering edit mode

12.3 years ago

shirley0818 ▴ 110

Dear All,

We are going to do RNA-Sequencing using Illumina HiSeq for 200 samples. Given that the sample size is fixed, and the budget is fixed, the following 3 options were proposed.

50bp pair-end reads, sequencing each sample per lane --> we will get ~100 million reads per sample
75bp pair-end reads, sequencing two samples per lane --> we will get ~50-60 million reads per sample
100bp pair-end reads, sequencing four samples per lane --> we will get ~30-40 million reads per sample

Based on your experience, which option is the best or you have other suggestions? We would like to do different kinds of analysis for these data, i.e.,novel transcripts, lncRNA, and splicing, SNP, etc. You name it. If we have to sort them by priority (from high to low), I would like to say " novel transcripts, long-noncoding RNAs splicing and differential expression".

Currently, the majority of labs sequence 100bp pair-end, right? But I was told that even you sequence 100bp long, after 75bp, the sequencing quality is very bad due to the issue of sequencer itself, that is, it has nothing with the RNA quality of samples. If this is true, why is 100bp read length becoming more popular now?

Many thanks, Shirley

rnaseq • 8.9k views

ADD COMMENT • link updated 12.2 years ago by NextGenSeek ▴ 290 • written 12.3 years ago by shirley0818 ▴ 110

1

Entering edit mode

Well I cant give you the exact reasons but I would prefer option 2 i.e. 75 bp and 50-60 million reads per sample.

ADD REPLY • link 12.3 years ago by Ashutosh Pandey 12k

1

Entering edit mode

If you are interested in differential expression, I would suggest to do replicates for each sample. It will increase the cost but may you could reduce the number of reads per replicate. E.g. In your first point, you could have ~35 million reads for each replicate (considering 3 replicates per sample).

ADD REPLY • link 12.3 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

Vikas - are you referring to technical replicates or biological replicates? The latter is better I expect - but I've heard a lot of banter on the social wires / blogs lately regarding "more replicates for RNA-Seq" and when probed further some seem to mean more technical replicates (n>3) for the same biological replicate are needed for RNA-Seq. This reminds me of the same lessons learned years ago with microarrays - but in that case biological replicates are always preferred. Would RNA-seq be any different

Thoughts?

ADD REPLY • link 12.3 years ago by Jonathanjacobs ▴ 280

4

Entering edit mode

Hi Shirley and Jonathan,

If I were you I would maybe run 190 samples and use the leftover money to run some pilot data to answer that question.

We worked on the differential expression problem in our paper on power analysis for RNA Seq (http://euler.bc.edu/marthlab/scotty/scotty.php) but that only focused on differential expression. We did not look at read length. It is an interesting question. For differential gene expression I would expect (without having done the analysis) that more shorter reads will give more information because the reads will align fairly uniquely even at 50 bases and you will detect more rare transcripts with lower counting noise. However, there comes a point in detecting differential expression where you will have sequenced enough to have quantified all of your genes pretty well (with ~10 reads) and sequencing the same sample deeper is a waste of money. That point varies by species and by sequencing protocol so for determining that point we recommend using pilot data. You can run an analysis through Scotty if you have the pilot data. Scotty expects replicates but if you don't have replicates you can just run a rarefaction curve to see where in you samples you get 10 reads per gene. If you need help email us.

Regarding replicates, as a general rule you get more statistical power for differential expression by dividing a fixed number of reads into as many biological replicates as possible. Think of a million biological replicates with one read each as the ideal way to spend a million reads. But then you add the cost of a million library preps.

Daniel is right of course. Adding another technical replicate improves power by reducing your uncertainty about the true expression level, but only works to reduce uncertainty that is due to technical noise. Most of the noise is usually biological (unless you have difficult to sequence samples, or other special conditions). So adding another biological replicate reduces both technical and biological uncertainty, and is generally more helpful. If you are doing something like 200 cell lines it would be awesome to do at least 2 biological replicates of each cell line. No one ever does two replicates and it would be so much more useful data because you would be less likely to mistake systematic biological noise for a real effect.

The answer of read count versus read length gets more complicated when you want to detect differential expression at the transcript level, novel transcripts, lncRNA, splicing, etc. In those cases you want to bridge junctions, and longer reads are better at that. Then the question becomes where trade off is between read length and read number. Insert size will play a role there too.

If you can, I would try to address all of this empirically this with pilot data. Perhaps you can find where someone has made a big library with 100 bp paired end reads you can try some assemblies (or whatever) with different permutations of the data. That is, try an assembly with a subset of the long reads and see how that goes, and then use more but trim them shorter and see how that goes. It would be better to run the experiment on your own data, with a library of each type because there may be other artifacts in there, and the answer may be species-specific. You could probably even get a small methods paper out of the work, and you have to set up your analysis pipeline anyway.

That's my opinion, anyway.

Good luck!

Michele

ADD REPLY • link 12.3 years ago by Michele Busby ★ 2.2k

1

Entering edit mode

Dear Michele,

We are working on 200 human samples collected from patients.

Thanks a lot for your detailed explanation and great suggestions. I will try Scotty and let you know if I need help:) I like your idea of either running some pilot data or finding available library with 100bp PE reads to answer my question. I am working on this now!

Thank you all for your great suggestions. I really appreciate. Shirley

ADD REPLY • link 12.3 years ago by shirley0818 ▴ 110

0

Entering edit mode

Hi Shirley,

If Scotty chokes going up to that many samples let me know. I think with that many samples you will be able to detect very small fold changes if you don't have any batch effects or similar. I may need to do some reprogramming to get it to handle that many samples but I'm happy to do it.

Thanks, Michele

ADD REPLY • link 12.2 years ago by Michele Busby ★ 2.2k

1

Entering edit mode

RNA-Seq should have low technical variability. Biological replicates would be preferred. Reduce the chances of lane bias artifacts by making sure you index and split your samples across multiple lanes. Your laboratory handling of the samples is more likely to introduce bias.

ADD REPLY • link 12.3 years ago by User 59 13k

score 1 · Answer 1 · 2013-04-12

I have not observed substantial quality decay occuring after the 75 bases, the technologies are advancing and are being upgraded all the time. I was recently surprised just how good a current MiSeq run coming our way is. It has 250 bp long reads with outstanding quality measures, moreover it maps exceedingly well to the target genome seemingly producing less than one sequencing error in a five thousand bases. And that seems better than the reported qualities.

I have come to believe that the RNA extraction and library preparation quality are probably more important and can introduce far larger effects than the inherent errors created by the sequencer.

As for your question, as Vikas Bansal suggests, replication is probably more important than any of your choices. Once that is covered properly then getting sufficient number of reads is the next priority to be able to quantify expression levels. I would also agree with ashutoshmits in picking a safe middle option that seems to balance both needs.

score 0 · Answer 2 · 2013-04-12

What is your target organism? Or just its approximate genome size, if you'd like to be secretive.

100bp being bad is certainly not true in general, as Istvan points out. It's much more a function of the library quality and perhaps how up-to-date your Illumina kits and protocols are.

I ask about genome size because the 50, 75, and 100-bp reads directly impact mappability and therefore the fraction of the genome available to you in these experiments. In a smaller genome 50 bp may be acceptable, but for large mammalian genomes you might want to go larger if you can. Simulation studies can address this if you're really curious.

After mapping the reads, in a molecular counting experiment like RNA-seq, each read pair counts as a vote from one molecule and counts the same regardless of its length (for most purposes). So from this point of view, it's always better to have more reads (more evidence, better statistics) than simply coverage (which is different from e.g. genomic resequencing).

For your more binary questions (novel transcripts, lncRNAs, splice site detection), one experiment per sample is good enough. For any differential expression questions, you certainly want biological replicates, as Vikas points out. If library preparation is limited, you could perhaps get away with only replicating a subset of the samples, and assuming the amount of variation is the same across all samples.

I assume you're aware of these papers, but they're very relevant examples of this type of study:

score 0 · Answer 3 · 2013-04-18

0

Entering edit mode

12.2 years ago

NextGenSeek ▴ 290

Another thing to consider is, whether your fragment size is big enough. If the fragment size is smaller, paired end read may overlap significantly. In that case , smaller read length or even single end read may be worth considering.

ADD COMMENT • link 12.2 years ago by NextGenSeek ▴ 290