Question

What Script (Or Program) Should I Use To Filter Low Complexity And Repetitive Reads From Rna-Seq?

6

Entering edit mode

13.4 years ago

Geparada ★ 1.5k

Hi! we are going to process some NGS data in order to find splice junctions. The low complexity and repetitive sequences into reads often generates false positive splice junctions. For this reason I need to remove all the reads which have low complexities sequences.

Maybe the most common way to do it's using RepeatMasker or Dustmasker, but I think it'll take very long time (because I have a very large NGS data). Another option, is map the reads (with BFAST or another mapper) to Repbase and take only unmapped reads. I thing bowtie isn't an option because I want a sensible filter.

Thanks for your time, I'll wait for your sugestions.

next-gen sequencing rna read repeatmasker • 5.8k views

ADD COMMENT • link updated 13.4 years ago by 2184687-1231-83- ★ 5.1k • written 13.4 years ago by Geparada ★ 1.5k

score 5 · Answer 1 · 2011-06-07

5

Entering edit mode

13.4 years ago

2184687-1231-83- ★ 5.1k

I have successfully used sga preprocess with dust option:

  --dust                           Perform dust-style filtering of low complexity reads. If you are performing
                                   de novo genome assembly, you probably do not want this.
  --dust-threshold=FLOAT           filter out reads that have a dust score higher than FLOAT (default: 4.0).
                                   This option implies --dust

http://github.com/jts/sga/

ADD COMMENT • link 13.4 years ago by 2184687-1231-83- ★ 5.1k

1

Entering edit mode

yes and yes. sga preprocess will produce a preprocessed fastq file. It will do a few Gbs in a few minutes.

ADD REPLY • link 13.4 years ago by 2184687-1231-83- ★ 5.1k

0

Entering edit mode

I've two questions: Can I use the dust-like filter option without using assembling function? and Is it efficient enough to run over large data in a reasonable time?

ADD REPLY • link 13.4 years ago by Geparada ★ 1.5k