Question

Trinity transcriptome quality assessment

0

Entering edit mode

7.7 years ago

cwbenson1993 • 0

I have 8 reps of illumina paired-end reads from a fungal RNA-seq experiment that I have de novo assembled using trinity. Trinity says to use the --jaccard_clip function if you predict high gene density, which may be the case for a small fungal genome.

I assembled the transcriptome twice. Once with --jaccard clip and once without --jaccard_clip and preformed a couple of the recommended quality assessment steps for each.

Read content in each of the transcriptomes is good. 99.12% of reads map to the transcriptome with --jaccard_clip transcriptome, and 98.97% to the transcriptome without --jaccard_clip.

Below, stats for each transcriptome generated using trinitystats.pl.

With --Jaccard_clip

Counts of transcripts, etc.
Total trinity 'genes': 18674
Total trinity transcripts: 30205
Percent GC: 62.50

Stats based on ALL transcript contigs:
Contig N50: 2481
Median contig length: 674
Average contig: 1296.03
Total assembled bases: 39146732


Stats based on ONLY LONGEST ISOFORM per 'GENE':
Contig N50: 2336
Median contig length: 363
Average contig: 1034.59
Total assembled bases: 19319873

Without --jaccard_clip

Counts of transcripts, etc.
Total trinity 'genes': 6106
Total trinity transcripts: 18773
Percent GC: 62.38

Stats based on ALL transcript contigs:
Contig N50: 4196
Median contig length: 2074
Average contig: 2720.12
Total assembled bases: 51064844

Stats based on ONLY LONGEST ISOFORM per 'GENE':
Contig N50: 3986
Median contig length: 1973.5
Average contig: 2514.36
Total assembled bases: 15352683

Trinity also recommend counting full length transcripts with BLAST to swissprot. Below transcripts were aligned to their best protein hit. The chart displays number of transcripts at various percent coverages. For more info on this chart, see https://github.com/trinityrnaseq/trinityrnaseq/wiki/Counting-Full-Length-Trinity-Transcripts

With --Jaccard_clip

hit_pct_cov_bin    count_in_bin     >bin_below
100   1046   1046
90  571 1617
80  404 2021
70  331 2352
60  327 2679
50  330 3009
40  305 3314
30  231 3545
20  281 3826
10  120 3946

Without --jaccard_clip

hit_pct_cov_bin   count_in_bin     >bin_below
100    2313  2313
90  764 3077
80  552 3629
70  440 4069
60  381 4450
50  333 4783
40  314 5097
30  269 5366
20  220 5586
10  102 5688

I would like to choose the better of these transcriptomes for my analysis, but Im still not sure which is the most representative. Does anyone have advice about how to make the final selection?

Assembly RNA-Seq • 4.0k views

ADD COMMENT • link updated 7.7 years ago by Chris Fields ★ 2.2k • written 7.7 years ago by cwbenson1993 • 0

score 0 · Answer 1 · 2018-02-19

0

Entering edit mode

7.7 years ago

Chris Fields ★ 2.2k

cwbenson1993 have you tried any of the other assessment recommendations from the Trinity docs?

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment

I recommend running BUSCO for both, but the others are well worth checking into (TransRate as well).

ADD COMMENT • link 7.7 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Hey Chris, thanks!

I tried BUSCO for the longest gene isoform of both files but didn't get great number for either assembly...

With --jaccard_clip

 C:74.4%[S:73.7%,D:0.7%],F:19.1%,M:6.5%,n:1335

Without --jaccard_clip

  C:73.3%[S:73.0%,D:0.3%],F:15.6%,M:11.1%,n:1335

ADD REPLY • link 7.7 years ago by cwbenson1993 • 0

0

Entering edit mode

It's a transcriptome, so you may not get the complete set of BUSCOs for your taxonomic group (this only represents what is expressed, unlike a genome).

The key is using this to compare various assembly versions (or assemblies using different tools). They both are fairly comparable but the --jaccard-clip is slightly higher. It might be better to run on all the data (not just the longest) using the 'transcriptome' mode if you aren't already doing that; the longest rep sequence may not always be the best.

ADD REPLY • link 7.7 years ago by Chris Fields ★ 2.2k