Question: Transcriptome Assembly with rnaSPAdes
gravatar for biofalconch
3.8 years ago by
biofalconch470 wrote:

Hello Everyone,

Very recently I completed a de novo assembly using rnaSPAdes (using a kmer size of 55) on some data I had, approximately over 1 billion reads. However, it seems like I am getting a lot of contigs (over 2 million), here is what they look like on terms of Length and Coverage.

Coverage Distribution: Coverage Distribution

Length vs Coverage

Is there a way to filter these out or a way to get a reduced number of contigs (kmer length?)?

rna-seq • 2.4k views
ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by biofalconch470

What is the expected genome size? What kind of data is this (Illumina, cycles, PE?). A billion reads may be an overkill for a relatively small genome.

ADD REPLYlink written 3.8 years ago by genomax89k

Hello, it is RNA-seq data, Illumina PE, 101 nt. I am trying to assemble different conditions from the same experiment,since I didn't count with the computational resources in the past to do so. As for the size, I was expecting maybe around 200,00 contigs.

ADD REPLYlink written 3.8 years ago by biofalconch470

As Genomax said, the expected genome size is very important in this case. The clade is also useful, as is the expected ploidy, the full set of preprocessing you did prior to assembly (such as contaminant removal), sample gathering/prep, etc. Basically, the more information, the better; for all I can tell right now, you might be producing an excellent assembly of a plant leaf meta-transcriptome.

ADD REPLYlink written 3.8 years ago by Brian Bushnell17k

Hello Brian, thanks for the reply. So as far as I know:

  • The genome is estimated to be 30Gb, with big intergenic regions.
    • It is a salamander and it is diploid
    • Preprocessing was trimming Illumina adapters and a sliding window to get rid of low quality reads.
    • The reads were generated by poly-A capture.
ADD REPLYlink written 3.8 years ago by biofalconch470

It's always funny to me when some random "primitive" species has a genome size many times larger than human :) I've heard that there are amoebae with much larger genomes, as well (>10Gbp). Previously, the largest I'd heard of was the Loblolly pine with 22 Gbp, but this takes the cake. Go vertebrates!

So, on-topic - for diploid assemblies, large numbers of contigs are not necessarily unexpected. This would be easier if you had DNA data too. Different organisms have different heterozygousity rates, which, for different assemblers, yield varying numbers of contigs. Fungi, for example, can have 1/30 het rates, which wreak havoc with assemblers. Do you have an idea what the heterozygousity rate of your salamander is?

Also, there are some decontamination procedures that might be useful. And considering they're slimy... did you take any special precautions to remove skin-dwelling organisms? And have you done any sort of digital decontamination?

ADD REPLYlink written 3.8 years ago by Brian Bushnell17k

I guess it is always interesting how some organisms hold such big genomes (just like Polychaos dubium, that holds 670 Gbp ), what they use it for it's anyone guess.

I do not expect too much heterozygousity, since the organisms used on the lab for this species have been inbreeding for the last couple decades (since 1890ish).

As far as I know, no decontamination procedures were followed, since the samples were obtained from embryo. I didn't perform any digital decontamination also.

ADD REPLYlink written 3.8 years ago by biofalconch470

You may want to go back to Trinity (if you have not done so). rnaSPAdes appears to be pretty new and a more established program may give you better results.

That said hardware requirements for Trinity are stiff and with such a large dataset you are bound to need hundreads of GB of RAM. Consider using galaxy at Indiana if you don't have the resources locally available.

ADD REPLYlink written 3.8 years ago by genomax89k

Just a quick follow up, I used BUSCO to look for orthologs in SPAdes and Trinity assemblies. I found that around 1000 genes are missing in the SPAdes assembly compared to Trinity, but right now I am doing a new assembly with a different kmer size to see if this changes.

ADD REPLYlink written 3.8 years ago by biofalconch470

I noticed this was around 2.6 years ago. Has rna-spades improved since then?

ADD REPLYlink written 13 months ago by O.rka210

I went through the changelogs and they seem to have improved the assembler, specially by implementing a multi kmer assembly. However, I do not know if this has a real impact on the assembly, might be worth to take a look.

ADD REPLYlink written 13 months ago by biofalconch470
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1042 users visited in the last hour