Hi! we are going to process some NGS data in order to find splice junctions. The low complexity and repetitive sequences into reads often generates false positive splice junctions. For this reason I need to remove all the reads which have low complexities sequences.
Maybe the most common way to do it's using RepeatMasker or Dustmasker, but I think it'll take very long time (because I have a very large NGS data). Another option, is map the reads (with BFAST or another mapper) to Repbase and take only unmapped reads. I thing bowtie isn't an option because I want a sensible filter.
Thanks for your time, I'll wait for your sugestions.
yes and yes. sga preprocess will produce a preprocessed fastq file. It will do a few Gbs in a few minutes.
I've two questions: Can I use the dust-like filter option without using assembling function? and Is it efficient enough to run over large data in a reasonable time?