Question

Compare different RNASeq assemblers

2

Entering edit mode

8.6 years ago

Anton Shekhov ▴ 60

Hello community!

I just started with bioinformatics, so I hope you can help me with understanding some basic things. I have RNA-Seq 100bp pair-ended reads from the Lepidoptera insect, for which there is no genome available. In total there were 2 treatments and one control, so for the assembly I merged all of them into two files (R1 and R2). So the goal is to make an assembly, and then continue with mapping, counts and statistic.

For the assembly I used next tools: Trinity, Velvet/Oases, Bridger and CLC assembler. From the last one I just got the ready-to-go assembly, to which I compared other assemblers.

Trinity was executed with default parameters (k=25),
Velvet: velveth/velvetg k=21,51,2; For each assembly run oases and compared n50 and total number of contigs. Chose the range of contigs from 25 till 39, with kmer=35, velveth/velvetg on the oases results and then merged them with oases.
Bridger. Default parameters with kmers 25 and 27.

After I compared assemblers with Quast. As a result got the next table

                    Oases     Trinity   Bridger 25  Bridger 27  CLC
#contigs            10706     15579     14881       14632       19985
#contigs (>= 0 bp)  36576     40429     37011       36162       35589
Largest contig      27732     18062     27729       27887       40347
Total length        12451630  19701362  20030913    19559912    36090599
N50                 1324      1528      1653        1651        2584
# N's per 100 kbp   0         0         0.05        0.21        212.61
# N's               0         0         11          41          76733

So, I could not reach the N50 of CLC, by using default parameters. But at least I can see how do they reach it. The number of mismatches in CLC is super high, so I was wondering - is it really worth to increase the total length and n50 by putting double amount of N's into the contigs?

And the next question - for some reason I though it would be a good idea to increase the input sequence length by merging R1 and R1, R2 and R2 reads. So basically doubling the amount of data assembler should deal with. So far I tested only Bridger (k=25), but the results are surprising, N50 increased to 1720, and the total length and the largest contigs became almost the same as in CLC results. What I think is happened that during the building of the graph low quality k-mers had a better chance to survive, thus increasing the contig length.

Thanks in advance for any kind of responds!

next-gen Trinity Assembly RNA-Seq Bridger • 5.1k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.6 years ago by Anton Shekhov ▴ 60

3

Entering edit mode

Increasing N50 is not always good, and this is particularly true for transcriptome assemblies. Similarly, it is not always helpful to feed more data to the assemblers - in case you want to use several fastq, consider performing digital normalization.

A good way of evaluating transcriptome assemblers is using Transrate - it maps the reads back to the assemblies and calculate some statistics to rank the assemblies.

edit: are you using QUAST or rnaQUAST?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by h.mon 35k

0

Entering edit mode

Thanks for the answer, and for the tip with Transrate - I will definitely try it. I used QUAST, as far as I understood it just going through the whole assembly and calculating statistic without mapping counts to it. Some kind of raw statistical measurment

ADD REPLY • link 8.6 years ago by Anton Shekhov ▴ 60

0

Entering edit mode

Yeah, quast is just basic contig statistics, make sure you use --scaffolds if it contains contigs combined with N's

ADD REPLY • link 8.6 years ago by moorem ▴ 240

Ram · Accepted Answer · 2015-10-09

6

Entering edit mode

8.6 years ago

Istvan Albert 100k

Assembly is somewhat of a "dark art" and that, as I found, means trying out many different methods/parameters until one seems to work. ... not the most scientific approach though. Merging the data if it is from different measurements and conditions should be just fine, even recommended. It is possible that some transcripts get only expressed under one of the conditions.

Fundamentally the problem is that N50 is not a good measure of assembly quality (although it usually correlates with it) and is the only one you can easily compute right away.

What I would recommend is to align assemblies against each other. You may find good agreement for some transcripts, these are more likely to be correct, then many disagreements and you can perhaps find some patterns.

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.6 years ago by Istvan Albert 100k

1

Entering edit mode

One metric you could try is assessing presence and abundance of single-copy orthologs (SCO, using BUSCO) on the assembly. BUSCO has a genome and transcriptome mode, and (in our case at least) has helped indicate completeness for genome assembly and point out potential redundancy issues, e.g. duplicated complete or fragmented SCO's.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Thank you! It will be the next think for me to try then.

But as for the CLC assembler, I still did not get the idea why do they put so many mismatches into the contigs? Is it a normal thing to do? I thought during the reads alignment to such contigs it would mess up results, cause there are to many possibilities to match the same contig with N's inside.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.5 years ago by Anton Shekhov ▴ 60

0

Entering edit mode

I have never fully understood the internal decision making in any assembler I have ever tried. I often look at assembled contigs and keep scratching my head - it is so clear that two of them should have been merged but they aren't. Many do work well and produce neat results out of complex data - but then mysteriously fail on some seemingly trivially easy to resolve situations. Then you do those by eye...

ADD REPLY • link updated 20 months ago by Ram 43k • written 8.5 years ago by Istvan Albert 100k

Ram · Accepted Answer · 2015-10-14

Mapping back the reads is definitely one of the good way to evaluate the assembly. Apart from it, you can try:

Comparing the assemblies by performing BLAST them against nearest species full-length cdna's / proteins sequences obtained from RefSeq and check how many your transcriptome picked up.
Finding full-length sequences in your assembly (using something like full-lengther_next)
Finding Open Reading Frames in each of your assemblies.