Question

Trinity predicting more number of genes?

0

Entering edit mode

8.3 years ago

kanika.151 ▴ 130

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':    35868
Total trinity transcripts:    54969
Percent GC: 51.52

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 9567
    Contig N20: 7769
    Contig N30: 6524
    Contig N40: 5393
    Contig N50: 4511

    Median contig length: 1780
    Average contig: 2555.95
    Total assembled bases: 140497949


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 7964
    Contig N20: 6149
    Contig N30: 5018
    Contig N40: 4126
    Contig N50: 3411

    Median contig length: 1077
    Average contig: 1843.53
    Total assembled bases: 66123622

This is what trinityStats.pl gives me after the assembly...

The total number of genes I was expecting were 12,510 but it is giving me 35,868 when I remove the isoforms it is still giving me 25,747 genes. Why is it giving me extra 13k genes?

Has anyone else stumbled on this trinity problem?

RNA-Seq genes trinity • 3.3k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 8.3 years ago by kanika.151 ▴ 130

2

Entering edit mode

I have gotten the same thing after Trinity. I isolated the longest sequence from output then used them for downstream analysis.

ADD REPLY • link 8.3 years ago by Mehmet ▴ 820

0

Entering edit mode

How did you do isolate the longest sequence?

ADD REPLY • link 8.3 years ago by kanika.151 ▴ 130

1

Entering edit mode

Trinity author Brian Haas has provided a perl script to extract longest isoforms from Trinity assemblies - alongside with this comment:

The longest transcript isn't always the 'best' transcript.... but this has been asked for so many times, I'll just write the script and post it shortly.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by h.mon 35k

0

Entering edit mode

Initially, I thought that I have not used the "--trimmomatic" or "--normalize_reads" parameters maybe thats why I was getting such a estimate and when I ran it again I am getting even more Trinity Transcripts. I think I will run the analysis for both longest transcripts and all of them. Thank you.

ADD REPLY • link 8.3 years ago by kanika.151 ▴ 130

0

Entering edit mode

we used a custom perl script. As it was mentioned on Trinity Frequently Asked Questions, you can use all transcripts for your downstream analysis. That is also reasonable.

ADD REPLY • link 8.3 years ago by Mehmet ▴ 820

0

Entering edit mode

Hello kanika.151!

Questions similar to yours can already be found at:

Trinity assmbly result check

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.

Cheers!

Re-opened because it wasn't exactly identical.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by Michael 54k

0

Entering edit mode

I think that is quite normal, most or all transcriptome assemblies will largely overestimate the number of transcripts, because of gaps. A factor or 2-3 is quite good I think. Why don't you map the reads to the genome instead and check for novel transcripts that way?

ADD REPLY • link 8.3 years ago by Michael 54k

0

Entering edit mode

If I do a genome based trinity how would it give me Novel transcripts?

ADD REPLY • link 8.3 years ago by kanika.151 ▴ 130

0

Entering edit mode

Why does it overpredicts? How can I explain it?

For the assembly, I had used 3 biological replicates so 3 times and I got 3 times the known genes that made me wonder was it really assembling the reads?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by kanika.151 ▴ 130

Ram · Answer 1 · 2016-01-04

0

Entering edit mode

8.3 years ago

kanika.151 ▴ 130

Okay, I got why it is over-estimating and how I can remove similar clusters.

While assembling I added the control and inoculated together which should have been done separately. Also, there is an algorithm called CD-HIT which helps in removing similar clusters to give out the needed assembly.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.3 years ago by kanika.151 ▴ 130