Since trinity output assembly ('original') has had a lot of duplicated matches with BLASTx, we decided to try reducing redundancy with tr2aacds script from EvidentialGene project. tr2aacds filters and merges contigs according to their coding potential and % of identity - sounds more legit than blast2cap3 approach or simple duplicates removal.
To compare original and filtered assemblies, we've done some check-ups with BUSCO and BLASTx. Results are - yes, yielding decrease of duplicates (BUSCO), but also increased number of missing and fragmented contigs. Yet these nr-assemblies are giving some, albeit much less, duplicated BLAST results. We're afraid to lose biologically meaningful data, but redundancy also leads to problems in further analysis.
Does anybody use a tr2aacds to reduce redundancy in de novo assemblies?