I'm working on RNA-seq data from different tissues of a non-model rodent that I want to use for two purposes. One is to assemble a transcriptome and the other is DGE analysis.
When I used FastQC on the samples, some of them had a high percentage of overrepresented sequences. I used blast with nt as a reference to identify them, and a lot of the matches have names like:
PREDICTED:_Suricata_suricatta_uncharacterized_LOC115301901_(LOC115301901),_ncRNA Proteus_phage_VB_PmiS-Isfahan,_complete_genome PREDICTED:_Macaca_mulatta_uncharacterized_LOC114679282_(LOC114679282),_ncRNA Eukaryotic_synthetic_construct_chromosome_14
Should these types of sequences be removed from the RNA-Seq data before using them to build a transcriptome or do DGE analyst? If yes, what type of filtering would be best?
I agree with jared.andrews07 on the technical aspect here. But in general you do not remove ncRNA from the RNA-seq analysis. it is RNA and therefore present in the cell, why removing it? You could remove small RNAs in standard RNA-seq because they are typically too short to be properly captured and sequenced (and would require special kits), therefore making their counts unreliable, but typically one does not bother. Trim your data and see if this solves the issue.