I de novo assembled several transcriptomes of the same organism and found that with the increase of reads (samples), the size of resulting assembly is larger and larger. But to my knowledege, this should contain redundant transcripts, right? what I want to ask is why this happens and how to remove these redundant transcripts. One more concern is that when removing redundance, is it possible that we lose some genes of the same family or the following quantification steps can be disturbed within the same family? As far as I know, there are following steps that may help: when assembling, use --normalize_reads to limit max read coverage and after trinity assembly, use Tgicl to extend the transcripts and use cd-hit to remove highly similar sequences. Are there some other effective tools or strategies that can help with this?
Yes. You most probably will get redundant fasta transcripts, even if you previously used the normalize read option within Trinity.
Some of these assemblies correspond to the same genes, and not necessarily to different isoforms, but to different assembly alternatives
Now, depending on the size of your assembled transcriptome you have several choices.
One is using CD-HIT, but you can be limited by the size of your transcriptome
I am sure that there have to be some other alternatives as well. Here you have a nice paper dealing with this subject