Question

de novo RNA-seq and different assembly options

1

Entering edit mode

9.7 years ago

samantha_jeschonek ▴ 50

Hi all -- I'm new to RNA-seq and have had some issues assembling the reads. I'm looking for any advice or input on what might be the best way to handle my data.

My work is done in oocytes of a non-model organism without a reference genome.

I have performed an RNA-IP against two different proteins and processed the IPs for paired end RNA seq, my goal is to identify the transcripts associated with both of these proteins (overlap). In addition, I have also processed whole oocytes for RNA-seq. Everything was done with three biological replicates.

Since my work is done in a non-model organism I have been using Trinity to assemble my paired-end RNA-seq data. There are a few ways, I think, this can be done however and I could use any input on what the best method might be. I've dabbled with one and had some errors, which is why I'm confused and wondering if a different approach is better.

OPTION 1: Assemble the whole oocyte transcriptome using Trinity and use this as a reference genome.
- After assembly, I used trinotate to cross reference my assembly to a recently-released protein database for my organism. I believe this assigned contigs a protein annotation.
- I then used the built-in Trinity plugins to align and estimate transcript abundance using RSEM for each IP sample separately
- I simply used the raw fastq files (left and right) for each IP (did they need to be assembled here??).
- Looking at the RSEM.isoforms.results output, I saw in every IP that a control transcript had a 0 FPKM, and I'm assuming is not expressed. This is obviously concerning...especially since using the sample sample I could identify my control transcript by qPCR.
OPTION 2: Assemble all IP reads together. In this case I would then map each IP's raw fastq file back to this "IP-transcriptome" to try and estimate transcript abundance using RSEM. (I'd ignore the whole oocyte data in this scenario)
OPTION 3: Individually and separately assembly each IP. I would then use transdecoder, trinotate, and blast to try and map these reads to the recently-released protein database. I would use the protein database.fasta file as the reference in this case.

Which option seems best? Any idea why my first approach failed to show the control transcript?

This is my first time doing RNA-seq so I apologize if these questions are very naive! All advice is greatly appreciated. Thank you!

blast RNA-Seq Assembly • 4.1k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by samantha_jeschonek ▴ 50

Ram · Answer 1 · 2014-08-07

2

Entering edit mode

9.7 years ago

Charles Warden 8.2k

I would probably pool the reads prior to de novo assembly (sounds like option #2?) and I would probably use CLC de novo or Oases over Trinity, but this is a common question with several possible answers. I've collected my own suggestions in this post, if you want to get a better idea about possible options.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.7 years ago by Charles Warden 8.2k

0

Entering edit mode

Thank you for your advice!

ADD REPLY • link 9.7 years ago by samantha_jeschonek ▴ 50

0

Entering edit mode

Hello Charles, I want detect lncRNA from some human (control and treatment) RNA-seq data in fastq format,I read the article of http://www.nature.com/articles/srep22698 ,which use clc genomics and de novo assembly pathway and...,I checked its data It is different to mine (In terms of library and format),now can I use its workflow to detect lncRNA?

I am using clc genomics for getting genes diff. exp.

Your attention would be really appreciated

ADD REPLY • link 7.7 years ago by Edalat ▴ 30

1

Entering edit mode

If you interested in human lncRNA, you might want to start with the GENCODE annotations without doing a de novo assembly:

http://www.gencodegenes.org/releases/current.html

Pre-existing assemblies might also exist for your specific topic of interest, but you can also align your reads against the assembly that you have made and BLAT the sequences for highly expressed transcripts to the human genome (to see if they overlap known annotations). The quantification part should something that you can do in CLC Bio. I can't provide more specific directions, but they are commercial software with their own tech support (support-clcbio@qiagen.com)

Not sure how you were comparing your human results to that Salmon paper (and CLC Bio should work with most sequencing platforms), but I would focus on the most highly expressed transcripts. If you do a transcriptome alignment to begin with, you can just focus on unaligned reads to try and see if there are any highly expressed novel lincRNAs.

ADD REPLY • link 7.7 years ago by Charles Warden 8.2k

1

Entering edit mode

Also, this should really be a separate question, and not a comment for this previous post that is only similar in that it involves a de novo assembly (in this case, in a non-model organism)

ADD REPLY • link 7.7 years ago by Charles Warden 8.2k

score 2 · Answer 2 · 2014-08-07

Do not worry about the program and the analyzing your data. Different programs, de novo assembly vs mapping using a reference, All give you very similar results. I did it for a plant without reference, using SOAPdenovo2, then again using a reference using Bowtie2. Also a mix of mapping and assembly using Bowtie2, SOAPdenov-Trans. The results are similar. For DEGs, I used DESeq and the results were almost the same for transcript abundance. However you might have to modify some options but we generally worked with standard options of programs and the results were satisfactory. Hope that it is useful!