Question: de novo transcriptome assembly
gravatar for 402374688
3.7 years ago by
40237468820 wrote:

I'm de novo assembling a transcriptome. I have RNA-seq data of treatment and control group with two time points. There are three replicates for each group. When doing the assembly, shall I pool all reads (from both control and treatment groups) to assemble or just use each replicate to do the assembly? Is it ok to pool them together and if assembling for each replicate what I shall do to make it comparable between different groups and differenet timepoints? Thank you.

rna-seq assembly • 1.6k views
ADD COMMENTlink modified 3.7 years ago by Chris Fields2.1k • written 3.7 years ago by 40237468820

Treat each replicate separately. Diff expression requires replicates. Never Pool, cringe.

ADD REPLYlink written 3.7 years ago by Biogeek380

Yeah, I know when doing the differential expression analysis it should be separated. But when I assemble the transcriptome, no matter treatment or control or different replicates they should have similar genes or transcripts, right? So can I pool them to do the assembly and map each replicate back to do diff expression?

ADD REPLYlink written 3.7 years ago by 40237468820

Just to be clear, everything goes into the one assembly. You should just have one assembly. counts use your individual reads then form a matrix using Trinity pipeline.

ADD REPLYlink written 3.7 years ago by Biogeek380

Got it. Really appreciate it. I'm new to assemblies and Thanks for your patient explanation.

ADD REPLYlink written 3.7 years ago by 40237468820

Not necessarily...

Your treatment and control will presumably differ when comapred

Within replicate groups, you may have one replicate which is an outlier, when pulled, how do you determine the rotten egg in the basket?

You may also have one read with contamination etc.

For transcriptome assembly, concatenate your reads in order (keeping the same order for both forward and reverse reads). Remember the transcriptome is an assembly of everything, so by feeding 1 concatenated left read and 1 concatenated right read (presuming you have read pairs) made up of all reads, that's fine. Check the Trinity github page out for some help.It's great if you're new to assemblies, plus it's very beginner friendly.

For abundance counting in RSEM for example, you will provide each set of reads as individual replicates. Pulling in RNA-Seq is a bit cringe-worthy and defeats the purpose. Having an idea of variability is also key. You will also limit yourself to downstream analysis if you pull - being stuck with one replicate. Statistics works off replicates.

ADD REPLYlink written 3.7 years ago by Biogeek380
gravatar for Chris Fields
3.7 years ago by
Chris Fields2.1k
University of Illinois Urbana-Champaign
Chris Fields2.1k wrote:

The approach suggested by the Trinity assembler developers suggests combining all sample reads for the assembly step (possibly using digital normalization to speed the assembly up), then realigns sample reads back to the assembly for filtering and DE analysis (the later step using salmon, RSEM, or alternative tools). This is explicitly stated in the notes for this workflow.

In addition, I also recommend following up assembly with Transrate to assess assembly quality and filter low-quality assembly artifacts, and then transcriptome annotation (I'm biased towards tools like Trinotate though others like commercial tools like BLAST2GO); this helps identify additional elements like rRNA that you can disregard. You can also use this screening for contaminants, if that is a potential issue in your assembly, as BLAST is a typical step for annotation purposes.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Chris Fields2.1k

Thank you. It's really useful for me. I'm a beginner of this area. When using some tools or softwares, I can understand general options which are used in most cases but when it comes to personalized option I just lose myself, for example the choice of k-mer length. It's frustrating when getting puzzled by these kinds of staff.

ADD REPLYlink written 3.7 years ago by 40237468820

I can relate. It's a long process learning; especially if you're new to bioinformatics. I've been doing it now for 2 years and still learn everyday. Stick to Trinity, and a fixed k-mer length of 25. SOAPtrans and velvet etc are more complex to use.

ADD REPLYlink written 3.7 years ago by Biogeek380

Yeah, it's in deed a long process especially when my supervisor isn't on bioinformatics. It's obvious when I just got started and easy to waste all day without realizing what to do. As a senior, can you give me some advice on how to get on board of this area (basically we are studying genome of an insect and doing some transcriptome analyses)?

ADD REPLYlink written 3.7 years ago by 40237468820


I'm a third year PhD student and I had the same problem ;-) not quite a senior yet haha. I think the main things to do which will help are:

Focus and read up on your approaches. Find an approach and read papers on how other authors implement it. It is not a race, it pays off to do your research before running ahead and applying methods. Use the forums (here and seq answers are good forums, plenty of helpful peeps to help/ advice you).

Ok, you have a genome, do you know how complete it is? For you, I would do a reference guided assembly, it's much easier and if your genome is quite good/complete, use it. You could use STAR aligner to produce your index and your BAM files, then align them to the genome and do counts using something like Cufflinks.

If it's not, go the de novo way. It's messier, but may yield a more complete analysis if your genome is fragmented and not very well annotated.

ADD REPLYlink written 3.7 years ago by Biogeek380

Keep in mind that Trinity now also performs reference-guided assemblies

ADD REPLYlink written 3.7 years ago by Chris Fields2.1k

I de novo assembled several transcriptomes of the same organism and found that with the increase of reads (samples), the size of resulting assembly is larger and larger. But to my knowledege, this should contain redundant transcripts, right? what I want to ask is how to remove these redundant transcripts(maybe we can call it). One more concern is that when removing redundance, is it possible that we lose some genes of the same family or the following quantification steps can be disturbed within the same family? As far as I know, there are following steps that may help: when assembling, use --normalize_reads to limit max read coverage and after trinity assembly, use Tgicl to extend the transcripts and use cd-hit to remove highly similar sequences. Are there some other effective tools or strategies that can help with this?

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by 40237468820


Yes evidentialgenes - tr2aacds may be a good resort for you. The tool is becoming more increasingly popular. I used to use cd-hit and cap3, although I think clustering can remove important genes. CD-HIT and cap3 do however seem to be used quite a lot. Up to you on what method you want to use.

ADD REPLYlink written 3.7 years ago by Biogeek380

Yes, definite +1 for evidentialgenes/tr2aacds.

ADD REPLYlink written 3.7 years ago by Chris Fields2.1k

I have donwloaded EG and I,ve used the tr2aacds script. One question come to me, maybe I've missed some configuration?

For a beginner, the straightforward use is just run this script and take the .okay subset (.tr .aa or .cds depends on the downstream analysis)? I've read the .doc files and I can't find a "configuration process", it looks like "so easy to use to be fine".

ADD REPLYlink written 2.3 years ago by pablo6199170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 911 users visited in the last hour