Hello community!
I just started with bioinformatics, so I hope you can help me with understanding some basic things. I have RNA-Seq 100bp pair-ended reads from the Lepidoptera insect, for which there is no genome available. In total there were 2 treatments and one control, so for the assembly I merged all of them into two files (R1 and R2). So the goal is to make an assembly, and then continue with mapping, counts and statistic.
For the assembly I used next tools: Trinity, Velvet/Oases, Bridger and CLC assembler. From the last one I just got the ready-to-go assembly, to which I compared other assemblers.
- Trinity was executed with default parameters (k=25),
 - Velvet: velveth/velvetg k=21,51,2; For each assembly run oases and compared n50 and total number of contigs. Chose the range of contigs from 25 till 39, with kmer=35, velveth/velvetg on the oases results and then merged them with oases.
 - Bridger. Default parameters with kmers 25 and 27.
 
After I compared assemblers with Quast. As a result got the next table
                    Oases     Trinity   Bridger 25  Bridger 27  CLC
#contigs            10706     15579     14881       14632       19985
#contigs (>= 0 bp)  36576     40429     37011       36162       35589
Largest contig      27732     18062     27729       27887       40347
Total length        12451630  19701362  20030913    19559912    36090599
N50                 1324      1528      1653        1651        2584
# N's per 100 kbp   0         0         0.05        0.21        212.61
# N's               0         0         11          41          76733
So, I could not reach the N50 of CLC, by using default parameters. But at least I can see how do they reach it. The number of mismatches in CLC is super high, so I was wondering - is it really worth to increase the total length and n50 by putting double amount of N's into the contigs?
And the next question - for some reason I though it would be a good idea to increase the input sequence length by merging R1 and R1, R2 and R2 reads. So basically doubling the amount of data assembler should deal with. So far I tested only Bridger (k=25), but the results are surprising, N50 increased to 1720, and the total length and the largest contigs became almost the same as in CLC results. What I think is happened that during the building of the graph low quality k-mers had a better chance to survive, thus increasing the contig length.
Thanks in advance for any kind of responds!
Increasing N50 is not always good, and this is particularly true for transcriptome assemblies. Similarly, it is not always helpful to feed more data to the assemblers - in case you want to use several fastq, consider performing digital normalization.
A good way of evaluating transcriptome assemblers is using Transrate - it maps the reads back to the assemblies and calculate some statistics to rank the assemblies.
edit: are you using QUAST or rnaQUAST?
Thanks for the answer, and for the tip with Transrate - I will definitely try it. I used QUAST, as far as I understood it just going through the whole assembly and calculating statistic without mapping counts to it. Some kind of raw statistical measurment
Yeah, quast is just basic contig statistics, make sure you use --scaffolds if it contains contigs combined with N's