Question: 50PE vs 100PE reads to assemble a de novo transcriptome
gravatar for Alicia
4.6 years ago by
United States
Alicia10 wrote:

Dear community,

I am designing a differential gene expression experiment in a non-model animal without a reference genome and I need your expertise.

My experiment has 2 different conditions in 2 different locations and I will have 3 replicates per condition. It is also a time-scale experiment with 4 different time points. Total of samples = ((2x3)x2)x4 = 48 samples.

My initial idea was to multiplex 12 samples per ILMN HiSeq lane at 50PE.

Since I don't have a reference I have to generate my own de novo transcriptome and here is where I need your help. I don't know if 50PE is going to be enough to generate a good reference transcriptome and maybe is better to go for 100PE in at least 2 lanes to have a better coverage.

What do you think? Of course, I have budget restrictions so I cannot sequence my 32 samples at 100PE.

Thanks for your help!

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Alicia10
gravatar for Chris Fields
4.6 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

Not sure the math is right here: 2 (conditions) x 2 (locations) x 4 (time points) x 3 (replicates) = 48 samples, not 32. Did you mean 2 replicates?

If at all possible use strand-specific 100nt PE reads, and generate your assembly off your experimental data. Pretty much everything is strand-specific these days (cost is the same, at least here at our facility) but it doesn't hurt to check and make sure.

Also, maybe I'm misunderstanding what you are trying to do, but just in case. Regardless whether you choose 50 or 100nt PE reads, don't mix sequence lengths, e.g. some samples at 50nt PE, others at 100nt PE. Why overly complicate an already complex experimental design with another potential confounding factor (variable sequencing length)? I could see issues hinging on sequence length (alignment, read counting, etc) possibly causing batch effects with drastically variable read length.

Re: @seta's answer, I have found that reference assemblies (ones made from multiple tissues, developmental stages, etc) are very useful for annotation and gene model building, but using them directly for gene expression analysis can be misleading. If you go this route I highly suggest making sure you aren't losing information by tracking read fates, in particular assessing how many reads don't map to the reference assembly (and thus may represent transcripts unique to your experimental conditions). I have personally seen several instances where <60% of reads map to a broad-based reference assembly, but >90% map back to a self-generated assembly, suggesting that information can be lost.

ADD COMMENTlink modified 7 months ago by RamRS27k • written 4.6 years ago by Chris Fields2.1k

Hi Chris, thanks for your point. The higher rate of mapping back sounds normal, I also saw in my work, but the mapping rate difference between a broad-based reference assembly and self-generated assembly is usually 10-15% for me. However, what you have experienced may also occur. Could you please let me know how you deal with these conditions when you have a good transcriptome assembly in terms of annotation that is not very informative for gene expression analysis as you mentioned?

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by seta1.3k

It never hurts to run test alignments to the reference assembly to get an idea how serious a problem it may be, but I always suggest running a de novo assembly + QC/filtering + RSEM as well; you can always map back or cluster to the original transcriptome if needed to get a rough idea, though I would also suggest running Trinotate. With modern versions of assemblers (e.g. Trinity) and digital normalization a typical de novo trx assembly doesn't take very long anymore; the bottleneck is then access to hardware (which in our case isn't an issue).

It really comes down to how much you trust that reference assembly and how well they compare. I have unfortunately run into too many instances where someone suggests using a reference trx assembly from lab X or pub Y, but when we've delved into how the assembly was made we find it problematic in some way (poorly documented methods, older seq technology, poor quality samples, shorter reads, made from SE data, not strand-specific, annotation is old or generated in a hard-to-determine way, filtered in an obscure way, should have use --jaccard_clip, etc). In one case I requested the reference assembly and got back the unigenes from a Trinity assembly (generated via tgicl); when asked they mentioned that was all that was provided, so isoform information was pretty much lost.

ADD REPLYlink modified 7 months ago by RamRS27k • written 4.6 years ago by Chris Fields2.1k
gravatar for seta
4.6 years ago by
seta1.3k wrote:

For making a good de novo transcriptome assembly, you should try to pool the extracted RNA from various tissues of your organism under different conditions and go for sequencing at 100 PE on one lane that will give you enough coverage.

ADD COMMENTlink modified 7 months ago by RamRS27k • written 4.6 years ago by seta1.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1099 users visited in the last hour