Dear all,
I'm working on a RNA-seq analysis of a non-model diploid with much heterozygosity.
This transcriptome assembly was done by CLC genomic software with (k=64) after read trimming, and exposed to blastx against uniprot database (viridiplantae), which just 30% of hits were the unique. In addition, I did another assembly with Trinity and mixed it with the CLC assembly, then subjected to cd-hit-est to remove redundancy (threshold 1), it generated 182968 clusters from 204397 input sequences, the blastx was done on this assembly against just arabidopsis proteome as database (for fast evaluation) and Although 80% of contigs got hit, only 20% of hits were unique. I also assess the average collapse factor for this assembly, which was 12.66 that isn't too high.These results makes me crazy as I don't know they are usual or not, what strategy is right? what's wrong and how to solve or even improve it? Please share me your opinion about the issue.
Thank you very much for your participation.
I am not familiar with this but if I were you, I will first try to extract the non-uniquely mapped contigs and see where they align too, then try and align the reference against each other e.g. arabidopsis proteome against arabidopsis proteome. If you still yield non-unique alignment, then it might be because the complexity of the reference is low? At least this should allow you to remove one possible suspect of such problem;