Trinity predicting more number of genes?
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':    35868
Total trinity transcripts:    54969
Percent GC: 51.52

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 9567
Contig N20: 7769
Contig N30: 6524
Contig N40: 5393
Contig N50: 4511

Median contig length: 1780
Average contig: 2555.95
Total assembled bases: 140497949

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 7964
Contig N20: 6149
Contig N30: 5018
Contig N40: 4126
Contig N50: 3411

Median contig length: 1077
Average contig: 1843.53
Total assembled bases: 66123622


This is what trinityStats.pl gives me after the assembly...

The total number of genes I was expecting were 12,510 but it is giving me 35,868 when I remove the isoforms it is still giving me 25,747 genes. Why is it giving me extra 13k genes?

Has anyone else stumbled on this trinity problem?

I have gotten the same thing after Trinity. I isolated the longest sequence from output then used them for downstream analysis.

How did you do isolate the longest sequence?

Trinity author Brian Haas has provided a perl script to extract longest isoforms from Trinity assemblies - alongside with this comment:

The longest transcript isn't always the 'best' transcript.... but this has been asked for so many times, I'll just write the script and post it shortly.

Initially, I thought that I have not used the "--trimmomatic" or "--normalize_reads" parameters maybe thats why I was getting such a estimate and when I ran it again I am getting even more Trinity Transcripts. I think I will run the analysis for both longest transcripts and all of them. Thank you.

we used a custom perl script. As it was mentioned on Trinity Frequently Asked Questions, you can use all transcripts for your downstream analysis. That is also reasonable.

I think that is quite normal, most or all transcriptome assemblies will largely overestimate the number of transcripts, because of gaps. A factor or 2-3 is quite good I think. Why don't you map the reads to the genome instead and check for novel transcripts that way?

If I do a genome based trinity how would it give me Novel transcripts?

Why does it overpredicts? How can I explain it?

For the assembly, I had used 3 biological replicates so 3 times and I got 3 times the known genes that made me wonder was it really assembling the reads?

Okay, I got why it is over-estimating and how I can remove similar clusters.

While assembling I added the control and inoculated together which should have been done separately. Also, there is an algorithm called CD-HIT which helps in removing similar clusters to give out the needed assembly.