Question: Compare different RNASeq assemblers
2
5.3 years ago by
Germany/Jena/MPI ICE
Anton Shekhov60 wrote:

Hello community!

I just started with bioinformatics, so I hope you can help me with understanding some basic things. I have RNA-Seq 100bp pair-ended reads from the Lepidoptera insect, for which there is no genome available. In total there were 2 treatments and one control, so for the assembly I merged all of them into two files (R1 and R2). So the goal is to make an assembly, and then continue with mapping, counts and statistic.

For the assembly I used next tools: Trinity, Velvet/Oases, Bridger and CLC assembler. From the last one I just got the ready-to-go assembly, to which I compared other assemblers.

1. Trinity was executed with default parameters (k=25),
2. Velvet: velveth/velvetg k=21,51,2; For each assembly run oases and compared n50 and total number of contigs. Chose the range of contigs from 25 till 39, with kmer=35, velveth/velvetg on the oases results and then merged them with oases.
3. Bridger. Default parameters with kmers 25 and 27.

After I compared assemblers with Quast. As a result got the next table

Oases Trinity Bridger 25 Bridger 27 CLC
#contigs
 10706
15579 14881 14632 19985
# contigs (>= 0 bp) 36576 40429 37011 36162 35589
Largest contig 27732      18062            27729       27887      40347
Total length  12451630 19701362 20030913 19559912 36090599
N50 1324 1528 1653 1651 2584
# N's per 100 kbp 0 0 0.05 0.21 212.61
# N's 0 0 11 41 76733

So, I could not reach the N50 of CLC, by using default parameters. But at least I can see how do they reach it. The number of mismatches in CLC is super high, so I was wondering - is it really worth to increase the total length and n50 by putting double amount of N's into the contigs?

And the next question - for some reason I though it would be a good idea to increase the input sequence length by merging R1 and R1, R2 and R2 reads. So basically doubling the amount of data assembler should deal with. So far I tested only Bridger (k=25), but the results are surprising, N50 increased to 1720, and the total length and the largest contigs became almost the same as in CLC results. What I think is happened that during the building of the graph low quality k-mers had a better chance to survive, thus increasing the contig length.

Thanks in advance for any kind of responds!

modified 5.3 years ago by Prakki Rama2.4k • written 5.3 years ago by Anton Shekhov60
3

Increasing N50 is not always good, and this is particularly true for transcriptome assemblies. Similarly, it is not always helpful to feed more data to the assemblers - in case you want to use several fastq, consider performing digital normalization.

A good way of evaluating transcriptome assemblers is using Transrate - it maps the reads back to the assemblies and calculate some statistics to rank the assemblies.

edit: are you using QUAST or rnaQUAST?

Thanks for the answer, and for the tip with Transrate - I will definitely try it. I used QUAST, as far as I understood it just going through the whole assembly and calculating statistic without mapping counts to it. Some kind of raw statistical measurment

Yeah, quast is just basic contig statistics, make sure you use --scaffolds if it contains contigs combined with N's

6
5.3 years ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

Assembly is somewhat of a "dark art" and that, as I found, means trying out many different methods/parameters until one seems to work. ... not the most scientific approach though. Merging the data if it is from different measurements and conditions should be just fine, even recommended. It is possible that some transcripts get only expressed under one of the conditions.

Fundamentally the problem is that N50 is not a good measure of assembly quality (although it usually correlates with it) and is the only one you can easily compute right away.

What I would recommend is to align assemblies against each other. You may find good agreement for some transcripts, these are more likely to be correct, then many disagreements and you can perhaps find some patterns.

1

One metric you could try is assessing presence and abundance of single-copy orthologs (SCO, using BUSCO) on the assembly. BUSCO has a genome and transcriptome mode, and (in our case at least) has helped indicate completeness for genome assembly and point out potential redundancy issues, e.g. duplicated complete or fragmented SCO's.

Thank you! It will be the next think for me to try then.

But as for the CLC assembler, I still did not get the idea why do they put so many mismatches into the contigs? Is it a normal thing to do? I thought during the reads alignment to such contigs it would mess up results, cause there are to many possibilities to match the same contig with N's inside.

I have never fully understood the internal decision making in any  assembler I have ever tried. I often look at assembled contigs and keep scratching my head - it is so clear that two of them should have been merged but they aren't. Many do work well and produce neat results out of complex data - but then mysteriously fail on some seemingly trivially easy to resolve situations. Then you do those by eye...

1
5.3 years ago by
Prakki Rama2.4k
Singapore
Prakki Rama2.4k wrote:

Mapping back the reads is definitely one of the good way to evaluate the assembly. Apart from it, you can try:

1. Comparing the assemblies by performing BLAST them against nearest species full-length cdna's / proteins sequences obtained from RefSeq and check how many your transcriptome picked up.
2. Finding full-length sequences in your assembly (using something like full-lengther_next)