Question

De novo transcriptome assembly of low amount of Illumina reads with MIRA

0

Entering edit mode

6.9 years ago

Andrés Ribone ▴ 60

Hi,

I'm working with paired end Illumina RNA-seq data from two varieties of a non model plant (there isn't any sequence available). The aim of the work is to find SSRs and SNPs.

The data is not big, I have only 6 million (120 pb) reads per strain (one individual per strain sequenced). I started assembling the data with Trinity, but the resulting assemblies weren't very good.(I run separate assemblies for each strain (I didn't mix the data)).

Then my advisor told me that, since there aren't too many reads, I could try using an Overlaph Graph Assembler, like MIRA. However, the resulting assembly turned worse: too many isotigs of lower size, with less BUSCO hits, and almost 57,62% (3.575.408) reads excluded as "debris". Of this "debris", 82% (2.866.908) were excluded because of digital normalization.

The manifest file was

project = G5
job = est,denovo,accurate
parameters = -NW:cmrnl=no
readgroup = G5_paired
data= paired_reads_1.fastq paired_reads_2.fastq
technology = solexa
template_size = 200 -1 exclusion_criterion autorefine
segment_placement = ---> <--- exclusion_criterion
segment_naming = solexa

Later, I read here http://seqanswers.com/forums/showthread.php?t=8210 that a solution could be run MIRA iteratively: use the isotigs generated and the reads excluded (the "debris") to run MIRA with a reference, and then do it again with this output. But I am not sure if this would be correct. Also, since this thread was a bit old (2010), maybe the "EST" job settings for MIRA were improved in the meanwhile.

So, my questions are ¿What do you suggest me to do? ¿Is the MIRA approach reasonable, or should I drop it? If not ¿The iteration alternative is correct?. Also, ¿Could I use, somehow, the Trinity assembled contigs with MIRA?.

Thanks in advance!

RNA-Seq MIRA de novo • 1.7k views

ADD COMMENT • link updated 6.9 years ago by h.mon 35k • written 6.9 years ago by Andrés Ribone ▴ 60

score 1 · Answer 1 · 2017-05-24

If the aim of the work is to find SSRs and SNPs, you don't need the best possible assembly. I would consider an assembly with as few redundancy as possible the best assembly for your goals. I can think of a number of ways to tackle this, but I don't know which is best.

1) you could feed your Trinity assembly to iAssembler - it will work better than cd-hit in reducing redundancy, in my experience

2) you could filter your Trinity assembly (longest isoforms and / or most expressed isoforms, or any other sensible way you can think of)

3) try building super-transcripts with Lace