3 millions de novo RNA mammalian contigs. Too many?
1
0
Entering edit mode
6.1 years ago
vlad1 • 0

I wonder if somebody saw similar number of contigs? Do you consider them as real splicing variants or assembling errors? The Trinity assembly was made from HiSeq 2x150 paired-end reads, ~110 mammalian brain samples. Totally ~ 6.3 total billions read pairs, 1,907,297 Mbases. I can't say how many reads were discarded after trimming. But the mean read quality doesn't seem unusual: % of >= Q30 Bases: 90.88; Quality Score: 37.96 Trinity parameters were default, i.e. included "insilico_read_normalization.pl --max_cov 50" Here are Trinity assembly metrics:

n_seqs  3236542
smallest    201
largest 20360
n_bases 2340786001
mean_len    723.23671
n_under_200 0
n_over_1k   642187
n_over_10k  609
n_with_orf  222378
mean_orf_percent    34.14792
n90 291
n70 594
n50 1136
n30 1954
n10 3685
gc  0.45103
bases_n 0
proportion_n    0

Thanks, Vlad

RNA-Seq Assembly next-gen • 1.2k views
ADD COMMENT
1
Entering edit mode

Trinity FAQ #1. Have been asked time and time again. That said, 3 million contigs is really a lot, it is a lot more than the "a lot" I have usually observed - in the range of 100-500 thousands. I've found the ExN50 to be really useful, particularly this part:

If you want to know, how many transcripts correspond to the Ex 90 peak, you could:

cat transcripts.TMM.EXPR.matrix.E-inputs |  egrep -v ^\# | awk '$1 <= 90' | wc -l
  
ADD REPLY
1
Entering edit mode
6.1 years ago

Map to the genome with GMAP (or of late minimap2), for GMAP choose GFF3 output. Then visualize in your favourite genome browser. I am sure there is a lot of absolute rubbish in there, particularly partial transcripts, so compare by locus to the Gencode transcript sets for example.

ADD COMMENT

Login before adding your answer.

Traffic: 1593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6