Hi guys,
we have RNA-seq data sequenced of an insect in 2012, and assembled them by using one of the Trinity 2011 versions at the time (got the trinity.fasta) . now I analyzed the sequence length distribution in this file , and got the result as follows:
kurban@kurban-X550VC:~/Downloads/bbmap$ sh stats.sh in=~/Downloads/gene.fa
stats.sh: 52: stats.sh: Bad substitution
stats.sh: 59: stats.sh: [[: not found
stats.sh: 59: stats.sh: [[: not found
stats.sh: 65: stats.sh: source: not found
stats.sh: 66: stats.sh: parseXmx: not found
A C G T N IUPAC Other GC GC_stdev
0.2875 0.2118 0.2067 0.2940 0.0000 0.0000 0.0000 0.4186 0.0894
Main genome scaffold total:          144777
Main genome contig total:            144777
Main genome scaffold sequence total: 67.067 MB
Main genome contig sequence total:   67.067 MB   0.000% gap
Main genome scaffold N/L50:          15033/1.075 KB
Main genome contig N/L50:            15033/1.075 KB
Max scaffold length:                 24.081 KB
Max contig length:                   24.081 KB
Number of scaffolds > 50 KB:         0
% main genome in scaffolds > 50 KB:  0.00%
Minimum  Number         Number         Total          Total          Scaffold
Scaffold of             of             Scaffold       Contig         Contig 
Length   Scaffolds      Contigs        Length         Length         Coverage
-------- -------------- -------------- -------------- -------------- --------
    All         144,777        144,777     67,066,997     67,066,997  100.00%
    100         144,777        144,777     67,066,997     67,066,997  100.00%
    250          56,929         56,929     53,670,774     53,670,774  100.00%
    500          30,137         30,137     44,518,044     44,518,044  100.00%
   1 KB          16,207         16,207     34,757,505     34,757,505  100.00%
2.5 KB           4,183          4,183     15,894,549     15,894,549  100.00%
   5 KB             588            588      3,942,668      3,942,668  100.00%
  10 KB              28             28        353,549        353,549  100.00%
in the file the min seq. length is 101; the longest one is 22181.
past several days I used the latest trinity version- trinityrnaseq-2.0.6, assembled the same raw data again(after low quality reads teamed of course). this time the length distribution of the file is as follows:
kurban@kurban-X550VC:~/Downloads/bbmap$ sh stats.sh in=~/Desktop/data_from_server/2015_6_04_assembled_CD_and_CK/Trinity.fasta
stats.sh: 52: stats.sh: Bad substitution
stats.sh: 59: stats.sh: [[: not found
stats.sh: 59: stats.sh: [[: not found
stats.sh: 65: stats.sh: source: not found
stats.sh: 66: stats.sh: parseXmx: not found
A C G T N IUPAC Other GC GC_stdev
0.2932 0.2083 0.2114 0.2871 0.0000 0.0000 0.0000 0.4197 0.0823
Main genome scaffold total:          56130
Main genome contig total:            56130
Main genome scaffold sequence total: 57.963 MB
Main genome contig sequence total:   57.963 MB   0.000% gap
Main genome scaffold N/L50:          9036/1.861 KB
Main genome contig N/L50:            9036/1.861 KB
Max scaffold length:                 30.733 KB
Max contig length:                   30.733 KB
Number of scaffolds > 50 KB:         0
% main genome in scaffolds > 50 KB:  0.00%
Minimum  Number         Number         Total          Total          Scaffold
Scaffold of             of             Scaffold       Contig         Contig 
Length   Scaffolds      Contigs        Length         Length         Coverage
-------- -------------- -------------- -------------- -------------- --------
    All          56,130         56,130     57,962,594     57,962,594  100.00%
    100          56,130         56,130     57,962,594     57,962,594  100.00%
    250          50,921         50,921     56,731,956     56,731,956  100.00%
    500          29,025         29,025     49,248,962     49,248,962  100.00%
   1 KB          18,003         18,003     41,494,038     41,494,038  100.00%
2.5 KB           5,541          5,541     21,499,015     21,499,015  100.00%
In this second trinity.fasta file the min sequence length is 224; the longest one is 30733.
My questions are:
- Why two assembly results are different,e.g. the former version of trinity assembled lots of sequences in length range from 101 to ~200 ? but the minimum length of the assembled sequence by using latest version of trinity is 224?
- Which trinity.fasta file should I use in the following analysis process ? and why?
Could u please give me little bit detailed explanation ?!
Thanks