Question

50PE vs 100PE reads to assemble a de novo transcriptome

1

Entering edit mode

8.4 years ago

Alicia ▴ 20

Dear community,

I am designing a differential gene expression experiment in a non-model animal without a reference genome and I need your expertise.

My experiment has 2 different conditions in 2 different locations and I will have 3 replicates per condition. It is also a time-scale experiment with 4 different time points. Total of samples = ((2x3)x2)x4 = 48 samples.

My initial idea was to multiplex 12 samples per ILMN HiSeq lane at 50PE.

Since I don't have a reference I have to generate my own de novo transcriptome and here is where I need your help. I don't know if 50PE is going to be enough to generate a good reference transcriptome and maybe is better to go for 100PE in at least 2 lanes to have a better coverage.

What do you think? Of course, I have budget restrictions so I cannot sequence my 32 samples at 100PE.

Thanks for your help!

RNA-Seq gene-expression de-novo transcriptome • 2.5k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by Alicia ▴ 20

Ram · Answer 1 · 2015-11-26

1

Entering edit mode

8.4 years ago

Chris Fields ★ 2.2k

Not sure the math is right here: 2 (conditions) x 2 (locations) x 4 (time points) x 3 (replicates) = 48 samples, not 32. Did you mean 2 replicates?

If at all possible use strand-specific 100nt PE reads, and generate your assembly off your experimental data. Pretty much everything is strand-specific these days (cost is the same, at least here at our facility) but it doesn't hurt to check and make sure.

Also, maybe I'm misunderstanding what you are trying to do, but just in case. Regardless whether you choose 50 or 100nt PE reads, don't mix sequence lengths, e.g. some samples at 50nt PE, others at 100nt PE. Why overly complicate an already complex experimental design with another potential confounding factor (variable sequencing length)? I could see issues hinging on sequence length (alignment, read counting, etc) possibly causing batch effects with drastically variable read length.

Re: @seta's answer, I have found that reference assemblies (ones made from multiple tissues, developmental stages, etc) are very useful for annotation and gene model building, but using them directly for gene expression analysis can be misleading. If you go this route I highly suggest making sure you aren't losing information by tracking read fates, in particular assessing how many reads don't map to the reference assembly (and thus may represent transcripts unique to your experimental conditions). I have personally seen several instances where <60% of reads map to a broad-based reference assembly, but >90% map back to a self-generated assembly, suggesting that information can be lost.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Hi Chris, thanks for your point. The higher rate of mapping back sounds normal, I also saw in my work, but the mapping rate difference between a broad-based reference assembly and self-generated assembly is usually 10-15% for me. However, what you have experienced may also occur. Could you please let me know how you deal with these conditions when you have a good transcriptome assembly in terms of annotation that is not very informative for gene expression analysis as you mentioned?

ADD REPLY • link 8.4 years ago by seta ★ 1.9k

0

Entering edit mode

It never hurts to run test alignments to the reference assembly to get an idea how serious a problem it may be, but I always suggest running a de novo assembly + QC/filtering + RSEM as well; you can always map back or cluster to the original transcriptome if needed to get a rough idea, though I would also suggest running Trinotate. With modern versions of assemblers (e.g. Trinity) and digital normalization a typical de novo trx assembly doesn't take very long anymore; the bottleneck is then access to hardware (which in our case isn't an issue).

It really comes down to how much you trust that reference assembly and how well they compare. I have unfortunately run into too many instances where someone suggests using a reference trx assembly from lab X or pub Y, but when we've delved into how the assembly was made we find it problematic in some way (poorly documented methods, older seq technology, poor quality samples, shorter reads, made from SE data, not strand-specific, annotation is old or generated in a hard-to-determine way, filtered in an obscure way, should have use --jaccard_clip, etc). In one case I requested the reference assembly and got back the unigenes from a Trinity assembly (generated via tgicl); when asked they mentioned that was all that was provided, so isoform information was pretty much lost.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Chris Fields ★ 2.2k

Ram · Answer 2 · 2015-11-26

0

Entering edit mode

8.4 years ago

seta ★ 1.9k

For making a good de novo transcriptome assembly, you should try to pool the extracted RNA from various tissues of your organism under different conditions and go for sequencing at 100 PE on one lane that will give you enough coverage.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by seta ★ 1.9k