I'm working on some transcriptomes from non-model organisms coming from Illumina sequences and I'm facing a problem I have't encountered before. To make it short, my data comes from sequencing 3 samples, each one consisting on a pool of 5 entire specimens, in an Illumina sequencer. I checked the quality in FastQC and trimmed with Trimmomatic acordingly. After that, I concatenated the resulting files to make a single assembly of the ~100M reads. Then, I did a standard Trinity assembly (without in-silico normalization). Here starts the strange part:
The assembly resulted in 526860 transcripts (isoform-level) with an N50 of 858, and a median contig length of 377. In addition (and this is what really makes me worry), I run BUSCO to asses completeness and I got the following result: C:98.6%[S:18.0%,D:80.6%],F:1.3%,M:0.1%,n:978.
This duplication level is ridiculously high, but I don't really know what is causing this. I've check the BUSCO documentation and both Biostars and SEQanswers but I haven't found duplications levels like this in a transcriptome. Have you have any similar experience? do you have any suggestion to make this numbers go down?
I'm stuck with this and would really appreciate any help.