What Script (Or Program) Should I Use To Filter Low Complexity And Repetitive Reads From Rna-Seq?
1
6
Entering edit mode
13.4 years ago
Geparada ★ 1.5k

Hi! we are going to process some NGS data in order to find splice junctions. The low complexity and repetitive sequences into reads often generates false positive splice junctions. For this reason I need to remove all the reads which have low complexities sequences.

Maybe the most common way to do it's using RepeatMasker or Dustmasker, but I think it'll take very long time (because I have a very large NGS data). Another option, is map the reads (with BFAST or another mapper) to Repbase and take only unmapped reads. I thing bowtie isn't an option because I want a sensible filter.

Thanks for your time, I'll wait for your sugestions.

next-gen sequencing rna read repeatmasker • 5.8k views
ADD COMMENT
5
Entering edit mode
13.4 years ago

I have successfully used sga preprocess with dust option:

  --dust                           Perform dust-style filtering of low complexity reads. If you are performing
                                   de novo genome assembly, you probably do not want this.
  --dust-threshold=FLOAT           filter out reads that have a dust score higher than FLOAT (default: 4.0).
                                   This option implies --dust

http://github.com/jts/sga/

ADD COMMENT
1
Entering edit mode

yes and yes. sga preprocess will produce a preprocessed fastq file. It will do a few Gbs in a few minutes.

ADD REPLY
0
Entering edit mode

I've two questions: Can I use the dust-like filter option without using assembling function? and Is it efficient enough to run over large data in a reasonable time?

ADD REPLY

Login before adding your answer.

Traffic: 861 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6