Question

3 millions de novo RNA mammalian contigs. Too many?

0

Entering edit mode

6.1 years ago

vlad1 • 0

I wonder if somebody saw similar number of contigs? Do you consider them as real splicing variants or assembling errors? The Trinity assembly was made from HiSeq 2x150 paired-end reads, ~110 mammalian brain samples. Totally ~ 6.3 total billions read pairs, 1,907,297 Mbases. I can't say how many reads were discarded after trimming. But the mean read quality doesn't seem unusual: % of >= Q30 Bases: 90.88; Quality Score: 37.96 Trinity parameters were default, i.e. included "insilico_read_normalization.pl --max_cov 50" Here are Trinity assembly metrics:

n_seqs  3236542
smallest    201
largest 20360
n_bases 2340786001
mean_len    723.23671
n_under_200 0
n_over_1k   642187
n_over_10k  609
n_with_orf  222378
mean_orf_percent    34.14792
n90 291
n70 594
n50 1136
n30 1954
n10 3685
gc  0.45103
bases_n 0
proportion_n    0

Thanks, Vlad

RNA-Seq Assembly next-gen • 1.2k views

ADD COMMENT • link updated 6.1 years ago by GenoMax 141k • written 6.1 years ago by vlad1 • 0

1

Entering edit mode

Trinity FAQ #1. Have been asked time and time again. That said, 3 million contigs is really a lot, it is a lot more than the "a lot" I have usually observed - in the range of 100-500 thousands. I've found the ExN50 to be really useful, particularly this part:

If you want to know, how many transcripts correspond to the Ex 90 peak, you could:
cat transcripts.TMM.EXPR.matrix.E-inputs |  egrep -v ^\# | awk '$1 <= 90' | wc -l
  

ADD REPLY • link 6.1 years ago by h.mon 35k

score 1 · Answer 1 · 2018-03-15

1

Entering edit mode

6.1 years ago

colindaven 6.3k

Map to the genome with GMAP (or of late minimap2), for GMAP choose GFF3 output. Then visualize in your favourite genome browser. I am sure there is a lot of absolute rubbish in there, particularly partial transcripts, so compare by locus to the Gencode transcript sets for example.

ADD COMMENT • link 6.1 years ago by colindaven 6.3k