Question: Cleaning up trinity assemblies
0
gravatar for dpearton
4.5 years ago by
dpearton0
South Africa
dpearton0 wrote:

Hi,

I have performed de novo assemblies on mRNA using trinity with PE illumina data (125bp PE).  Trinity gives a large number of contigs (around 250,000) but many of these are very short (200-500bp).  In addition there can be many versions of the same contig - either different isoforms, splice varients or variant assemblies.  Is there a way to "rationalise" or collapse the assembly?  For example only taking the longest isoform of each contig? I know that would possibly be throwing away any info on splice varients but that is not something that I'm too interested in at the moment. And/or having a size cut-off?

Assembly stats are as follows:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  185527
Total trinity transcripts:      252342
Percent GC: 40.60

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 4739
        Contig N20: 3372
        Contig N30: 2579
        Contig N40: 1972
        Contig N50: 1452

        Median contig length: 406
 Average contig: 803.80
        Total assembled bases: 202831907


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 4338
        Contig N20: 2823
        Contig N30: 1957
        Contig N40: 1258
        Contig N50: 802

        Median contig length: 357
        Average contig: 621.99
        Total assembled bases: 115396193

 

Thanks,

Dave

rna-seq assembly • 3.3k views
ADD COMMENTlink modified 4.5 years ago by seta1.2k • written 4.5 years ago by dpearton0
0
gravatar for seta
4.5 years ago by
seta1.2k
Sweden
seta1.2k wrote:

If the short contig is not your interest, you can easily apply the flag of --min_contig_length 400 or 500, for example. However, be careful about it as some of protein sequences have short length, then you may miss them. Although there is a script to get the longest isoform, the longest transcript is not always the best one, so you can consider filter the lowly supported transcript using RSEM output. Hope this helps.

ADD COMMENTlink modified 4 months ago by RamRS26k • written 4.5 years ago by seta1.2k

Hi,

I know the longest might not be the "best" but I'm not sure what other criteria to use. I imagine that reads/contig normalised for length might be useful but I have no idea how to implement that. I've visualised my assemblies in tablet and there is a wide range of reads/contig.

I've not used RSEM - how would this work without a reference genome?

ADD REPLYlink modified 4 months ago by RamRS26k • written 4.5 years ago by dpearton0

I usually use the min_contig_length 300 to get rid of many short contigs, you can type just --min_contig_length 300 along with your trinity command. About RSEM, please use the align_and_estimate_aboundance.pl script within Trinity package then using RSEM output, you can filter contigs with fpkm less than 1. You can take a look at http://trinityrnaseq.sourceforge.net/analysis/abundance_estimation.html

ADD REPLYlink modified 4 months ago by RamRS26k • written 4.5 years ago by seta1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1271 users visited in the last hour