I have a lot of rna-seq paired end data which have a very good quality, but some of the files have a lot of overrepresented sequences, not adapters. I made a blast of these sequences. Some of them didn't match to anything, and some other seems to be rRNA. I understand that there are divided opinions, and some people say is better to remove the overrepresented sequences, and others says that there's no need to. This time i decided to remove them with cutadapt, because the overrepresented sequences varies from one file to another. But after removing them, the FastQC basic stadistics of these files changed (sequence length 1-150) and NEW overrepresented sequences appeared (i wasn't expecting to obtain more of the initial ones). I'm thinking that maybe i made a mistake with the cutadapt and want to try with trimmomatic, but i can't find in the manual, an option where i can specify the sequence that i want to remove from a specific file (my impression is that with trimmomatic i can remove only adapters that are recognized by the software). Can anyone give me an advice about what to do in order to proceed with the (de novo) assembly?
RNAseq will always contain over-represented sequences, because certain genes will be overly expressed and, thus, will result in over-represented sequences. If you remove these sequences, you will be removing genes, and your assembly will be less complete and / or more fragmented. Except for adapters, one should not remove any sequences to perform assembly. You may (and this is Trinity default, for example) perform digital normalization prior to assembly, to reduce memory usage and run time.