Question: De novo transcriptome assembly of low amount of Illumina reads with MIRA
gravatar for Andrés Ribone
3.2 years ago by
Andrés Ribone0 wrote:


I'm working with paired end Illumina RNA-seq data from two varieties of a non model plant (there isn't any sequence available). The aim of the work is to find SSRs and SNPs.

The data is not big, I have only 6 million (120 pb) reads per strain (one individual per strain sequenced). I started assembling the data with Trinity, but the resulting assemblies weren't very good.(I run separate assemblies for each strain (I didn't mix the data)).

Then my advisor told me that, since there aren't too many reads, I could try using an Overlaph Graph Assembler, like MIRA. However, the resulting assembly turned worse: too many isotigs of lower size, with less BUSCO hits, and almost 57,62% (3.575.408) reads excluded as "debris". Of this "debris", 82% (2.866.908) were excluded because of digital normalization.

The manifest file was

project = G5
job = est,denovo,accurate
parameters = -NW:cmrnl=no
readgroup = G5_paired
data= paired_reads_1.fastq paired_reads_2.fastq
technology = solexa
template_size = 200 -1 exclusion_criterion autorefine
segment_placement = ---> <--- exclusion_criterion
segment_naming = solexa

Later, I read here that a solution could be run MIRA iteratively: use the isotigs generated and the reads excluded (the "debris") to run MIRA with a reference, and then do it again with this output. But I am not sure if this would be correct. Also, since this thread was a bit old (2010), maybe the "EST" job settings for MIRA were improved in the meanwhile.

So, my questions are ¿What do you suggest me to do? ¿Is the MIRA approach reasonable, or should I drop it? If not ¿The iteration alternative is correct?. Also, ¿Could I use, somehow, the Trinity assembled contigs with MIRA?.

Thanks in advance!

rna-seq mira de novo • 1.0k views
ADD COMMENTlink modified 3.2 years ago by h.mon30k • written 3.2 years ago by Andrés Ribone0
gravatar for h.mon
3.2 years ago by
h.mon30k wrote:

If the aim of the work is to find SSRs and SNPs, you don't need the best possible assembly. I would consider an assembly with as few redundancy as possible the best assembly for your goals. I can think of a number of ways to tackle this, but I don't know which is best.

1) you could feed your Trinity assembly to iAssembler - it will work better than cd-hit in reducing redundancy, in my experience

2) you could filter your Trinity assembly (longest isoforms and / or most expressed isoforms, or any other sensible way you can think of)

3) try building super-transcripts with Lace

ADD COMMENTlink modified 2.7 years ago • written 3.2 years ago by h.mon30k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1081 users visited in the last hour