I have 8 reps of illumina paired-end reads from a fungal RNA-seq experiment that I have de novo assembled using trinity. Trinity says to use the --jaccard_clip function if you predict high gene density, which may be the case for a small fungal genome.
I assembled the transcriptome twice. Once with --jaccard clip and once without --jaccard_clip and preformed a couple of the recommended quality assessment steps for each.
Read content in each of the transcriptomes is good. 99.12% of reads map to the transcriptome with --jaccard_clip transcriptome, and 98.97% to the transcriptome without --jaccard_clip.
Below, stats for each transcriptome generated using trinitystats.pl.
With --Jaccard_clip
Counts of transcripts, etc.
Total trinity 'genes': 18674
Total trinity transcripts: 30205
Percent GC: 62.50
Stats based on ALL transcript contigs:
Contig N50: 2481
Median contig length: 674
Average contig: 1296.03
Total assembled bases: 39146732
Stats based on ONLY LONGEST ISOFORM per 'GENE':
Contig N50: 2336
Median contig length: 363
Average contig: 1034.59
Total assembled bases: 19319873
Without --jaccard_clip
Counts of transcripts, etc.
Total trinity 'genes': 6106
Total trinity transcripts: 18773
Percent GC: 62.38
Stats based on ALL transcript contigs:
Contig N50: 4196
Median contig length: 2074
Average contig: 2720.12
Total assembled bases: 51064844
Stats based on ONLY LONGEST ISOFORM per 'GENE':
Contig N50: 3986
Median contig length: 1973.5
Average contig: 2514.36
Total assembled bases: 15352683
Trinity also recommend counting full length transcripts with BLAST to swissprot. Below transcripts were aligned to their best protein hit. The chart displays number of transcripts at various percent coverages. For more info on this chart, see https://github.com/trinityrnaseq/trinityrnaseq/wiki/Counting-Full-Length-Trinity-Transcripts
With --Jaccard_clip
hit_pct_cov_bin    count_in_bin     >bin_below
100   1046   1046
90  571 1617
80  404 2021
70  331 2352
60  327 2679
50  330 3009
40  305 3314
30  231 3545
20  281 3826
10  120 3946
Without --jaccard_clip
hit_pct_cov_bin   count_in_bin     >bin_below
100    2313  2313
90  764 3077
80  552 3629
70  440 4069
60  381 4450
50  333 4783
40  314 5097
30  269 5366
20  220 5586
10  102 5688
I would like to choose the better of these transcriptomes for my analysis, but Im still not sure which is the most representative. Does anyone have advice about how to make the final selection?
Hey Chris, thanks!
I tried BUSCO for the longest gene isoform of both files but didn't get great number for either assembly...
With --jaccard_clip
Without --jaccard_clip
It's a transcriptome, so you may not get the complete set of BUSCOs for your taxonomic group (this only represents what is expressed, unlike a genome).
The key is using this to compare various assembly versions (or assemblies using different tools). They both are fairly comparable but the
--jaccard-clipis slightly higher. It might be better to run on all the data (not just the longest) using the 'transcriptome' mode if you aren't already doing that; the longest rep sequence may not always be the best.