I am trying to prepare two files containing several millions of illumina RNA pair-end reads for a De Novo assembly using Trinity, and, as I posted the other day I have some doubts about how to prepare the datasets in order to obtain the best transcriptome assembly.
In this case my doubt is about haw would affect the assembly the overrepresentation of some sequences. My datasets have a deep coverage and, as a result, I have a great overrepresentation of some (non-artifact) sequences (some of them representing up to the 0.2% of the total number of sequences) and a huge level of sequence duplication (73% aprox.). Are this parameters important for the quality of the assembly? How can I solve this if it is important? Should I normalized the datasets before performing the assembly?
I would be very grateful if someone can help me with this (at least for me) puzzling issue.