Question: de novo RNA-seq and different assembly options
gravatar for samantha_jeschonek
6.5 years ago by
United States
samantha_jeschonek50 wrote:

Hi all -- I'm new to RNA-seq and have had some issues assembling the reads. I'm looking for any advice or input on what might be the best way to handle my data.

My work is done in oocytes of a non-model organism without a reference genome.  

I have performed an RNA-IP against two different proteins and processed the IPs for paired end RNA seq, my goal is to identify the transcripts associated with both of these proteins (overlap). In addition, I have also processed whole oocytes for RNA-seq. Everything was done with three biological replicates.

Since my work is done in a non-model organism I have been using Trinity to assemble my paired-end RNA-seq data. There are a few ways, I think, this can be done however and I could use any input on what the best method might be. I've dabbled with one and had some errors, which is why I'm confused and wondering if a different approach is better.

  1. OPTION 1: 
    Assemble the whole oocyte transcriptome using Trinity and use this as a reference genome.
  • ​After assembly, I used trinotate to cross reference my assembly to a recently-released protein database for my organism. I believe this assigned contigs a protein annotation.
  • I then used the built-in Trinity plugins to align and estimate transcript abundance using RSEM for each IP sample separately
  • I simply used the raw fastq files (left and right) for each IP (did they need to be assembled here??).
  • Looking at the RSEM.isoforms.results output, I saw in every IP that a control transcript had a 0 FPKM, and I'm assuming is not expressed. This is obviously concerning...especially since using the sample sample I could identify my control transcript by qPCR.

2. OPTION 2:
Assemble all IP reads together. In this case I would then map each IP's raw fastq file back to this "IP-transcriptome" to try and estimate transcript abundance using RSEM. (I'd ignore the whole oocyte data in this scenario)

3. OPTION 3:
Individually and separately assembly each IP.  I would then use transdecoder, trinotate, and blast to try and map these reads to the recently-released protein database.  I would use the protein database.fasta file as the reference in this case.

Which option seems best? Any idea why my first approach failed to show the control transcript?

This is my first time doing RNA-seq so I apologize if these questions are very naive! All advice is greatly appreciated. Thank you!

blast rna-seq assembly • 3.5k views
ADD COMMENTlink modified 6.5 years ago by smjazayeri20 • written 6.5 years ago by samantha_jeschonek50
gravatar for Charles Warden
6.5 years ago by
Charles Warden8.0k
Duarte, CA
Charles Warden8.0k wrote:

I would probably pool the reads prior to de novo assembly (sounds like option #2?) and I would probably use CLC de novo or Oases over Trinity, but this is a common question with several possible answers. I've collected my own suggestions in the following post, if you want to get a better idea about possible options:

ADD COMMENTlink written 6.5 years ago by Charles Warden8.0k

Thank you for your advice!

ADD REPLYlink written 6.5 years ago by samantha_jeschonek50

Hello Charles, I want detect lncRNA from some human (control and treatment) RNA-seq data in fastq format,I read the article of ,which use clc genomics and de novo assembly pathway and...,I checked its data It is different to mine (In terms of library and format),now can I use its workflow to detect lncRNA?

I am using clc genomics for getting genes diff. exp.

Your attention would be really appreciated

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Edalat30

If you interested in human lncRNA, you might want to start with the GENCODE annotations without doing a de novo assembly:

Pre-existing assemblies might also exist for your specific topic of interest, but you can also align your reads against the assembly that you have made and BLAT the sequences for highly expressed transcripts to the human genome (to see if they overlap known annotations). The quantification part should something that you can do in CLC Bio. I can't provide more specific directions, but they are commercial software with their own tech support (

Not sure how you were comparing your human results to that Salmon paper (and CLC Bio should work with most sequencing platforms), but I would focus on the most highly expressed transcripts. If you do a transcriptome alignment to begin with, you can just focus on unaligned reads to try and see if there are any highly expressed novel lincRNAs.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Charles Warden8.0k

Also, this should really be a separate question, and not a comment for this previous post that is only similar in that it involves a de novo assembly (in this case, in a non-model organism)

ADD REPLYlink written 4.5 years ago by Charles Warden8.0k
gravatar for smjazayeri
6.5 years ago by
smjazayeri20 wrote:

Do not worry about the program and the analyzing your data. Different programs, de novo assembly vs mapping using a reference, All give you very similar results. I did it for a plant without reference, using SOAPdenovo2, then again using a reference using Bowtie2. Also a mix of mapping and assembly using Bowtie2, SOAPdenov-Trans. The results are similar. For DEGs, I used DESeq and the results were almost the same for transcript abundance. However you might have to modify some options but we generally worked with standard options of programs and the results were satisfactory. Hope that it is useful!

ADD COMMENTlink written 6.5 years ago by smjazayeri20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2071 users visited in the last hour