Question: Trinity assembly validation and statistics
2
gravatar for giorgiocasaburi
3.9 years ago by
United States
giorgiocasaburi90 wrote:

Hi All,

After one week Trinity finally completed an assembly starting with 800 million reads (an entire Next Seq 500 run). The statistics are weird, although there were tons of sequences, but I would like your opinion:

################################

## Counts of transcripts, etc.

################################

Total trinity 'genes':    858807

Total trinity transcripts:    924905

Percent GC: 40.20

 

########################################

Stats based on ALL transcript contigs:

########################################

 

    Contig N10: 1739

    Contig N20: 769

    Contig N30: 490

    Contig N40: 382

    Contig N50: 324

 

    Median contig length: 270

    Average contig: 363.98

    Total assembled bases: 336649200

 

 

#####################################################

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

#####################################################

 

    Contig N10: 1123

    Contig N20: 575

    Contig N30: 421

    Contig N40: 349

    Contig N50: 304

 

    Median contig length: 268

    Average contig: 341.45

    Total assembled bases: 293239534

I used Trinity with default parameters and using --trimmomatic plus --min_kmer_cov 2. I really was expecting the N50 to be bigger. What can be the reason for that? 

Note: Before starting the assembly I quality filtered the sequences and merged the results in two big paired end fasta files. 

 

Please any advice can be precious!

Thanks!

Giorgio

 

ADD COMMENTlink modified 3.9 years ago by seta1.1k • written 3.9 years ago by giorgiocasaburi90
1

That's a lot of transcripts. Try using the read normalization parameter. What species are you assembling?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by st.ph.n2.4k
1
gravatar for seta
3.9 years ago by
seta1.1k
Sweden
seta1.1k wrote:

Hi, it sounds that your assembly is so fragmented that may resulted from using -min_kmer_cov 2. Try --normalize_reads and keep -min_kmer_cov as default. It took long time to finish, could you please tell us what's you system (RAM and CPU)?

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by seta1.1k
0
gravatar for giorgiocasaburi
3.9 years ago by
United States
giorgiocasaburi90 wrote:

Thanks for your answer. I know it is a lor and I did not run the digital normalization, maybe I should have considering that I have close to a billion raw reads. Do you think that might be the reason of such low N50? Anyway the species is an Hawaiian Squid (no genome annotation at all since it has tons of STR), predicted genome of about 3.8 GB.

 

 

ADD COMMENTlink written 3.9 years ago by giorgiocasaburi90
0
gravatar for giorgiocasaburi
3.9 years ago by
United States
giorgiocasaburi90 wrote:

Thanks seta! I have access to an HPC at my University. For that above analysis I ran with 20 cores with 19GB of ram each. It took a week total. Also, I forgot to mention that the analysis was ran on a unique file that I merged from several samples processed in the same run. Total of 23 libraries (4 fastq files each line paired end; so total of 8 files actually) and an overall total of 184 fastq files (~1 Gb each). 

A) Could the fragmentation in your experience be due by the merging of the sample? 

B) Should I run a distinct assembly for every single library?

C) Or should I just I considered one total file merged and run the normalization with -min_kmer_cov 1 (default)  like you said?

Note: The libraries are from the same tissue but total different treatment if that matters at all.

 

Thanks!!! 

 

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by giorgiocasaburi90

If your end goal is to look at the DEGs from your different treatments then you want a single assembly with all samples done at the same time. Individual assemblies will be difficult to analyse. When you say 'unique file' did you merge all your paired end reads into one file? If so, that's why it took so long. I suggest merging all your forward reads into one file, and all your reverse reads into a second. Use params -left and -right to designate read orientation. Also, please reply in the comment sections, opposed to make an answer to your question.

ADD REPLYlink written 3.9 years ago by st.ph.n2.4k

Sorry I mean unique file as one merged with the R1 reads and another merged with the R2 reads. I did use left and right obviously. The goal is indeed compare the treatment so I would stuck with the pooling approach. I am currently running the assembly again using normalization and -min_kmer_cov 1. I actually didn't even use the merged files but rather pass everything to Trinity --left and --right comma separated (e.g.  --left library_1_R1.fastq, library_2_R1.fastq, library_N_R1.fastq --right --left library_1_R2.fastq, library_2_R2.fastq, library_N_R2.fastq) although I don't think it would make any difference as Trinity will merge everything anyway.

ADD REPLYlink written 3.9 years ago by giorgiocasaburi90
0
gravatar for seta
3.9 years ago by
seta1.1k
Sweden
seta1.1k wrote:

As Stehan said, if your goal is to assess DEG, single assembly is better, but you need high quality assembly to do successful downstream analysis. more read may cause high fragmentation in your assembly, so you can try assembly in individual libraries and then merge them using some tool like, TGICL or EvidentialGene tool. Also, you can try multiple-kmer assembler like soap-denovo or CLC, work with higher k-mer to get rid of lots of transcripts. your case is interesting, please come back to the post whenever you get desired results using any pipeline.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by seta1.1k

Hello !  I'm not sure if this is a comment or not, but definitely not an answer...  I have a few questions...

 

 

I have recently completed a trinity assembly with about the same amount of data as the initial post, but for quail.  Shy of 1.4 billion reads on a NextSeq 500.  I did run the default --min_kmer_cov,  --normailize_reads.  Similarly, all R1 and R2 were each merged into a single files, and passed using --left and --right.  I did modify the --trimomatic settings to be a bit more stringent, thus we filtered out 15% of the reads, and assembled the transcriptome with remaining 85% of the data (normalised).  The assembly took roughly the same time, 10 d, with about 20-30 failed commands that were re-ran, until the assembly was successful.    The intention is quite the same, run a differential expression analysis between treated and controls used in the assembly. 

My concern is the number of "genes" that is being reported 685.214 (see statistics below).  Were you able to run any of the multiple-kmers assemblers suggested here ?

We obtained fantastic overall mapping rates using TopHat2, about 88-95% of the reads from the individual libraries from the sampels; the one problem we're observing is that we have have 60-65% of the reads mapping to multiple locations (see alignment summary further below).  I'm assuming it's due to the number of isoforms per gene.  Has anyone here been able to merge/combine of the isoforms into a single gene model as to avoid the redundancy ?   If so, could you share what you've done ?  It would be really nice to have a tool that will construct and merge the sequences and provide you a gtf to run with TopHat.

Thank you in advance,

Rodrigo

 

Our assembly statistics are :

################################
## Counts of transcripts, etc.
################################

Total trinity 'genes':    685214
Total trinity transcripts:    924982
Percent GC: 42.37

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 5546
    Contig N20: 4009
    Contig N30: 3096
    Contig N40: 2421
    Contig N50: 1876

    Median contig length: 502
    Average contig: 1012.59
    Total assembled bases: 936624846

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 5128
    Contig N20: 3537
    Contig N30: 2588
    Contig N40: 1904
    Contig N50: 1386

    Median contig length: 435
    Average contig: 829.66
    Total assembled bases: 568491645


==> TopHat/.../align_summary.txt <==

Left reads:
          Input     :  32813854
           Mapped   :  30971280 (94.4% of input)
            of these:  19899801 (64.3%) have multiple alignments (131621 have >20)
Right reads:
          Input     :  32813854
           Mapped   :  30304829 (92.4% of input)
            of these:  19594664 (64.7%) have multiple alignments (131618 have >20)
93.4% overall read mapping rate.

 

ADD REPLYlink written 3.6 years ago by rodrigo.gularte0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1934 users visited in the last hour