Question

Transcriptome Assembly with rnaSPAdes

1

Entering edit mode

7.4 years ago

biofalconch ★ 1.1k

Hello Everyone,

Very recently I completed a de novo assembly using rnaSPAdes (using a kmer size of 55) on some data I had, approximately over 1 billion reads. However, it seems like I am getting a lot of contigs (over 2 million), here is what they look like on terms of Length and Coverage.

Coverage Distribution:

Length vs Coverage

Is there a way to filter these out or a way to get a reduced number of contigs (kmer length?)?

RNA-Seq • 4.4k views

ADD COMMENT • link 7.4 years ago by biofalconch ★ 1.1k

3

Entering edit mode

What is the expected genome size? What kind of data is this (Illumina, cycles, PE?). A billion reads may be an overkill for a relatively small genome.

ADD REPLY • link 7.4 years ago by GenoMax 141k

0

Entering edit mode

Hello, it is RNA-seq data, Illumina PE, 101 nt. I am trying to assemble different conditions from the same experiment,since I didn't count with the computational resources in the past to do so. As for the size, I was expecting maybe around 200,00 contigs.

ADD REPLY • link 7.4 years ago by biofalconch ★ 1.1k

1

Entering edit mode

As Genomax said, the expected genome size is very important in this case. The clade is also useful, as is the expected ploidy, the full set of preprocessing you did prior to assembly (such as contaminant removal), sample gathering/prep, etc. Basically, the more information, the better; for all I can tell right now, you might be producing an excellent assembly of a plant leaf meta-transcriptome.

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Hello Brian, thanks for the reply. So as far as I know:

The genome is estimated to be 30Gb, with big intergenic regions.
- It is a salamander and it is diploid
- Preprocessing was trimming Illumina adapters and a sliding window to get rid of low quality reads.
- The reads were generated by poly-A capture.

ADD REPLY • link 7.4 years ago by biofalconch ★ 1.1k

0

Entering edit mode

It's always funny to me when some random "primitive" species has a genome size many times larger than human :) I've heard that there are amoebae with much larger genomes, as well (>10Gbp). Previously, the largest I'd heard of was the Loblolly pine with 22 Gbp, but this takes the cake. Go vertebrates!

So, on-topic - for diploid assemblies, large numbers of contigs are not necessarily unexpected. This would be easier if you had DNA data too. Different organisms have different heterozygousity rates, which, for different assemblers, yield varying numbers of contigs. Fungi, for example, can have 1/30 het rates, which wreak havoc with assemblers. Do you have an idea what the heterozygousity rate of your salamander is?

Also, there are some decontamination procedures that might be useful. And considering they're slimy... did you take any special precautions to remove skin-dwelling organisms? And have you done any sort of digital decontamination?

ADD REPLY • link 7.4 years ago by Brian Bushnell 20k

0

Entering edit mode

I guess it is always interesting how some organisms hold such big genomes (just like Polychaos dubium, that holds 670 Gbp ), what they use it for it's anyone guess.

I do not expect too much heterozygousity, since the organisms used on the lab for this species have been inbreeding for the last couple decades (since 1890ish).

As far as I know, no decontamination procedures were followed, since the samples were obtained from embryo. I didn't perform any digital decontamination also.

ADD REPLY • link 7.4 years ago by biofalconch ★ 1.1k

1

Entering edit mode

You may want to go back to Trinity (if you have not done so). rnaSPAdes appears to be pretty new and a more established program may give you better results.

That said hardware requirements for Trinity are stiff and with such a large dataset you are bound to need hundreads of GB of RAM. Consider using galaxy at Indiana if you don't have the resources locally available.

ADD REPLY • link 7.4 years ago by GenoMax 141k

1

Entering edit mode

Just a quick follow up, I used BUSCO to look for orthologs in SPAdes and Trinity assemblies. I found that around 1000 genes are missing in the SPAdes assembly compared to Trinity, but right now I am doing a new assembly with a different kmer size to see if this changes.

ADD REPLY • link 7.4 years ago by biofalconch ★ 1.1k

0

Entering edit mode

I noticed this was around 2.6 years ago. Has rna-spades improved since then?

ADD REPLY • link 4.7 years ago by O.rka ▴ 710

0

Entering edit mode

I went through the changelogs and they seem to have improved the assembler, specially by implementing a multi kmer assembly. However, I do not know if this has a real impact on the assembly, might be worth to take a look.

ADD REPLY • link 4.7 years ago by biofalconch ★ 1.1k