I have obtained RNAseq pair-end data from an Illumina Hiseq run.
Initial FASTQC check of the raw data showed quite good quality in term of Q value and length distribution. However, the module regarding Duplication level of FASTQC showed high level of duplication (10-20% have duplicate in range >10 to >5k) and the percent of seq remain after deduplicate is only around 15%. Is this phenomenon normal in RNAseq data? Could you please give me some advices regarding this problem?
In addition, should I use Prinseq "-derep" parameter to filter out replicated reads in these raw data (for example -derep 24), Will this filtering step affect the further analysis of differential expression ?
Thank you very much!