Hello everyone,
I just assembled a data set from a non-model organism with Trinity, however, I am getting many contigs. I ran cd-hit to remove the redundancy, but I still have many contigs. I am also concerned about having a high duplication rate according to BUSCO. What do you recommend I do?
Before CD-HIT:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  277062
Total trinity transcripts:      416235
Percent GC: 42.48
########################################
Stats based on ALL transcript contigs:
########################################
        Contig N10: 3438
        Contig N20: 2526
        Contig N30: 1984
        Contig N40: 1583
        Contig N50: 1231
        Median contig length: 451
        Average contig: 774.04
        Total assembled bases: 322183936
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794
        Median contig length: 370
        Average contig: 603.55
        Total assembled bases: 167220426
After of CD-HIT (cd-hit-est -o cdhit -c 0.98 -i Trinity.fasta -p 1 -d 0 -b 3 -T 10):
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  276194
Total trinity transcripts:      396337
Percent GC: 42.40
########################################
Stats based on ALL transcript contigs:
########################################
        Contig N10: 3325
        Contig N20: 2428
        Contig N30: 1903
        Contig N40: 1504
        Contig N50: 1158
        Median contig length: 437
        Average contig: 744.38
        Total assembled bases: 295026540
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
        Contig N10: 3086
        Contig N20: 2125
        Contig N30: 1538
        Contig N40: 1097
        Contig N50: 794
        Median contig length: 371
        Average contig: 604.02
        Total assembled bases: 166826505
Since you used
trinitythis must be RNAseq data. In that case getting many contigs is not unexpected nor is some "redundancy". Did you run BUSCO in transcript mode?Hello Geno,
If it is RNA-seq data and I ran BUSCO in Galaxy in transcriptome mode:
A version of the genome already exists, however, the authors have not yet authorized its use for massive studies: