assembly through trinity
2
0
Entering edit mode
7.0 years ago

Hi

I am using trinity on galaxy.In output I received fasta file .kindly guide how to capture basic assembly stats.

Assembly galaxy • 2.0k views
ADD COMMENT
0
Entering edit mode

Hi, kindly be more specific and provide information on what you consider basic assembly stats.

ADD REPLY
0
Entering edit mode
7.0 years ago
st.ph.n ★ 2.7k

perl /path/to/Trinity/install/Trinity_stats.pl Trinity_output.fasta

or save this (from Trinity_stats.pl) to a file, and run with perl if you have it installed:

#!/usr/bin/env perl

use strict;
use warnings;

use FindBin;
use lib ("$FindBin::RealBin/../PerlLib");
use Fasta_reader;
use BHStats;

my $usage = "\n\nusage: $0 transcripts.fasta\n\n";

my $fasta_file = $ARGV[0] or die $usage;

main: {

    my $fasta_reader = new Fasta_reader($fasta_file);

    my @all_seq_lengths;

    my $number_transcripts = 0;

    my $num_GC = 0;

    my %component_to_longest_isoform;

    my $tot_seq_len = 0;

    while (my $seq_obj = $fasta_reader->next()) {

        my $acc = $seq_obj->get_accession();

        $number_transcripts++;
         my $comp_id;

        if ($acc =~ /^(.*c\d+_g\d+)/) {
            $comp_id = $1;
        }
        elsif ($acc =~ /^(.*comp\d+_c\d+)/) {
            $comp_id = $1;
        }
        else {
            die "Error, cannot decipher gene identifier from acc: $acc";
        }


        my $sequence = $seq_obj->get_sequence();

        my $seq_len = length($sequence);

        $tot_seq_len += $seq_len;

        if ( (! exists $component_to_longest_isoform{$comp_id})
             || 
             $component_to_longest_isoform{$comp_id} < $seq_len) {

            $component_to_longest_isoform{$comp_id} = $seq_len;
        }

        push (@all_seq_lengths, $seq_len);

        while ($sequence =~ /[gc]/ig) {
            $num_GC++;
        }

    }

    print "\n\n";
    print "################################\n";
    print "## Counts of transcripts, etc.\n";
    print "################################\n";

    print "Total trinity 'genes':\t" . scalar(keys %component_to_longest_isoform) . "\n";        
    print "Total trinity transcripts:\t" . $number_transcripts . "\n";


    my $pct_gc = sprintf("%.2f", $num_GC / $tot_seq_len * 100);


    print "Percent GC: $pct_gc\n\n"; 

    print "########################################\n";
    print "Stats based on ALL transcript contigs:\n";
    print "########################################\n\n";

    &report_stats(@all_seq_lengths);

    print "\n\n";
    print "#####################################################\n";
    print "## Stats based on ONLY LONGEST ISOFORM per 'GENE':\n";
    print "#####################################################\n\n";

    &report_stats(values %component_to_longest_isoform);


    print "\n\n\n";

    exit(0);
}

####
sub report_stats {
    my (@seq_lengths) = @_;

    @seq_lengths = reverse sort {$a<=>$b} @seq_lengths;

    my $cum_seq_len = 0;
    foreach my $len (@seq_lengths) {
        $cum_seq_len += $len;
    }

    for (my $i = 10; $i <= 50; $i += 10) {
        my $cum_len_needed = $cum_seq_len * $i/100;
        my $N_val = &get_contigNvalue($cum_len_needed, \@seq_lengths);
        print "\tContig N$i: $N_val\n";
    }
    print "\n";


    my $median_len = &BHStats::median(@seq_lengths);
    print "\tMedian contig length: $median_len\n";

    my $avg_len = sprintf("%.2f", &BHStats::avg(@seq_lengths));
    print "\tAverage contig: $avg_len\n";



    print "\tTotal assembled bases: $cum_seq_len\n";

    return;
}



sub get_contigNvalue {
    my ($cum_len_needed, $seq_lengths_aref) = @_;

    my $partial_sum_len = 0;
    foreach my $len (@$seq_lengths_aref) {
        $partial_sum_len += $len;

        if ($partial_sum_len >= $cum_len_needed) {
            return($len);
        }
    }


    return -1; # shouldn't happen.
}
ADD COMMENT
0
Entering edit mode

Does that work in Galaxy?

ADD REPLY
0
Entering edit mode
7.0 years ago

Hi.

How it was mentioned by st.ph.n the simplest way to get assembly stats is using the comand Trinity_stats.pl that come available with Trinity, but you need to instal Trinity in your computer.

I do not know if the Galaxy wrapper of Trinity has this available. The best way is ask at https://biostar.usegalaxy.org/

If you want, there exist many other software you can use to evaluate the assembled transcriptome. For instance, Transrate (http://hibberdlab.com/transrate/)

For a more detailed explanation in transcriptome assembly quality assesment you can consult https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome-Assembly-Quality-Assessment

Regards

ADD COMMENT

Login before adding your answer.

Traffic: 2934 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6