Question: Creating Reference De Novo Assembly Using Velvet/Oases
0
gravatar for nbvasani
6.1 years ago by
nbvasani230
United States
nbvasani230 wrote:

Hi Fellows,

I am planning to generate reference de novo assembly as follow:

1] Creating reference assembly by pooling reads of all 6-samples i.e. WT and transgenic together in one fastq file.

2] Run velveth with auto parameter for K from 27 to 71 on that single fastq file

3] Afterword using velvethg for each value of K

4] Running oases on contig.fa file generated from each value of K

5] Transcript.fa file generated from each assembly are merged using Merged option in Oases

6] Checking merged assembly, by mapping each sample reads back to merged assembly using bowtie.

Am I on right path? Any suggestion would be appreciated.

Thanks,

bioinformatics rna-seq • 7.6k views
ADD COMMENTlink modified 6.1 years ago by Charles Warden7.3k • written 6.1 years ago by nbvasani230
1
gravatar for Charles Warden
6.1 years ago by
Charles Warden7.3k
Duarte, CA
Charles Warden7.3k wrote:

You might have seen this on your Trinity-related thread, but here is my suggestion:

I would recommend using the Oases_pipeline.py script.

This allows you to run all of the steps in a single command. In fact, you can use it to run Oases / Velvet multiple times and combing the results (so you don't have to worry about finding the absolute best parameters to use).

For example, the following command will run Velvet / Oases with kmers set to 21 and 23:

python oases_pipeline.py -m 21 -M 23 data/reads.fa

http://www.ebi.ac.uk/~zerbino/oases/OasesManual.pdf

Also, you might find my suggestions from this thread to be helpful:

"ID" in Oases pipeline stats.txt file

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Charles Warden7.3k

I read that comment in my post but I am confused about how to create de novo reference assembly. Thanks a lot

ADD REPLYlink written 6.1 years ago by nbvasani230

If you are talking about Trinity or Oases, you are talking about RNA-Seq, correct?

The reference here is not a genomic reference, but a transcript reference.

The output of Oases includes a transcripts.fa file. The extra link that I added to this response describes some strategies I used to create an optimal list of transcripts (parameters to make the transcripts.fa file smaller as well as select one sequence per gene). However, this might not be technically necessary.

eXpress aligns reads to a list of mRNAs (unlike aligners such as TopHat, which align to the DNA-based reference genome, and then require a program like cufflinks to provide mRNA quantification). If you are working with RNA-Seq data, this is the only option that you have (since you ideally shouldn't have any sequence from introns or intergenic sequence).

So, once you run the Oases command shown above, you can start referring to the eXpress manual. Also, if you are familar with Perl, I have a templete for this sort of pipeline run_RNA_Seq_de_novo.pl):

https://sites.google.com/site/cwarden45/scripts

ADD REPLYlink written 6.1 years ago by Charles Warden7.3k

I understand for my rna-seq data, the reference will be transcript reference. Any suggestion about RSEM vs eXpress. I will try to understnd and run script.

Thanks for all your feedback. Naresh

ADD REPLYlink written 6.1 years ago by nbvasani230

I haven't personally tried RSEM with de novo RNA-Seq data, but I know that is a popular algorithm. If you have time, you can just try them both.

In the script description, I also mention CLC Bio de novo. Although not specifically designed for RNA-Seq, I found it to be useful. In fact, it provides coverage statistics along with the contigs (so, a separate mRNA quantification step is not needed). If your institution has a license for CLC Bio Genomics Workbench, I would recommend trying that. Otherwise, Oases + eXpress is the only other thing that I can currently stand behind.

ADD REPLYlink written 6.1 years ago by Charles Warden7.3k

Thanks! Can you please give me any suggestion about -amos_file option while running velvet/Oases. Why it is useful and it's importance.

Best, Naresh

ADD REPLYlink written 6.1 years ago by nbvasani230

I don't think you should have to use that parameter.

I believe amos is a file format, so it is only necessary if you want to work with that file format:

http://wiki.bioinformatics.ucdavis.edu/index.php/Illumina_assembly_using_velvet

I think the only necessary parameters are listed in the "impatient people" section of the manual, which I think is fairly well documented (and there are lots of velvet tutorials on the web)

ADD REPLYlink written 6.1 years ago by Charles Warden7.3k

Thanks cwarden45.

ADD REPLYlink written 6.1 years ago by nbvasani230

Hi Cwarden45, I need your help. I used express for abundancy. And for downstream analysis I want to use edgeR. But I am having hard time understading about how to run edgeR package. Can you please guide me step by step. I would really appreciate that.

Thanks, Naresh.

ADD REPLYlink written 6.1 years ago by nbvasani230

Hi Cwarden45,

I need your help. I used express for abundancy. And for downstream analysis I want to use edgeR. But I am having hard time understading about how to run edgeR package. Can you please guide me step by step. I would really appreciate that.

Thanks, Naresh

ADD REPLYlink written 6.1 years ago by nbvasani230

How many samples do you have?

edgeR is meant for differenital expression - meaning you compare the expression patterns for the same gene in multiple samples divided into groups (most often, two groups).

The input for edgeR will be a table of read counts. Once you can produce this, I think the edgeR instructions are pretty good.

The potential problem is that this is typically done against a known reference genome. In other words, eXpress will tell you the abudances for transcripts in a particular sample, but it will not be sufficient to tell you what transcript IDs match between samples.

Using BLAST can help by identifying homologs to other, characterized genomes. For me, this was the end step: I sorted the transcripts by expression levels and checked if the type of genes made sense (for samples coming from muscle, fat, etc.). I have not yet had to define differentially expressed genes from a RNA-Seq de novo assembly study. You might be able to use your BLAST results to guess which transcripts match (or simply use BLAST to map the transcripts between samples instead of the NCBI nucleotide database). However, I don't have any advice in regards to how you can determine the optimal criteria to determine that transcripts from two samples are indeed the same (for example, I think it is likely that the regions of coveraged sequence will often vary - so, only portions of transcripts actually align).

ADD REPLYlink written 6.1 years ago by Charles Warden7.3k

Hi Cwarden45,

Thanks for your feedback.

I have 3 different samples. From express, which count did you used i.e. eff_count, est_count, etc for edgeR.

I agree with you about blast. I am also running blast2go on side to get sequence annotation.

Thanks in advance! Naresh

ADD REPLYlink written 6.1 years ago by nbvasani230

There are a few caveats (see below), but I would probably say tot_count is what you would use.

1) If you only have 3 samples, then you must have at least one group without a replicate. Although you might technically be able run edgeR and/or DESeq without replicates, I don't think the results will be as useful. You may just want to calculate fold-change values, without worry about p-values. If you do this, you should use FPKM values, or you could just rank the genes and compare differences in percentiles.

2) I never used edgeR - my goal was to determine what types of genes were expressed in a couple samples. So, I just wanted to see which genes were most highly expressed (using FPKM and eff_count). The relative expression in one sample to another sample didn't matter.

3) I technically didn't find this result to be the best: I actually found the de novo assembly program in CLC Bio (not specifically designed for RNA-Seq) to give the most reasonable top ~20 genes (using the coverage statistic provided by that program). However, I did think this was the best result out of the open-source tools that I tested. So, you may need to do some trial and error on your own. For example, using the Oases (i.e Velvet) contigs may actually work better than the Oases transcripts (even if you filter them). I would just make sure you understand what each eXpress metric means:

http://bio.math.berkeley.edu/eXpress/manual.html#expr

Good luck!

ADD REPLYlink written 6.1 years ago by Charles Warden7.3k

Thanks a lot cwarden45,

ADD REPLYlink written 6.1 years ago by nbvasani230
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2639 users visited in the last hour