What Script (Or Program) Should I Use To Filter Low Complexity And Repetitive Reads From Rna-Seq?
1
6
Entering edit mode
11.5 years ago

Hi! we are going to process some NGS data in order to find splice junctions. The low complexity and repetitive sequences into reads often generates false positive splice junctions. For this reason I need to remove all the reads which have low complexities sequences.

Maybe the most common way to do it's using RepeatMasker or Dustmasker, but I think it'll take very long time (because I have a very large NGS data). Another option, is map the reads (with BFAST or another mapper) to Repbase and take only unmapped reads. I thing bowtie isn't an option because I want a sensible filter.

5
Entering edit mode
11.5 years ago

I have successfully used sga preprocess with dust option:

  --dust                           Perform dust-style filtering of low complexity reads. If you are performing
de novo genome assembly, you probably do not want this.
--dust-threshold=FLOAT           filter out reads that have a dust score higher than FLOAT (default: 4.0).
This option implies --dust


http://github.com/jts/sga/

1
Entering edit mode

yes and yes. sga preprocess will produce a preprocessed fastq file. It will do a few Gbs in a few minutes.

0
Entering edit mode

I've two questions: Can I use the dust-like filter option without using assembling function? and Is it efficient enough to run over large data in a reasonable time?