Question

Filtering of RNA fastq files prior to aligment

0

Entering edit mode

7.8 years ago

ognjen011 ▴ 290

I have a pair of RNA fastq files that are heavily contaminated with rRNA sequences, to such extent that almost one half of reads originate from a 10K long sequence of reference. I want to remove these reads, and I know I can do it at BAM level in one of many ways (ike Samtools View), but I am curious if there is an efficient way of removing those reads at the fastq stage. The idea is to speed up the alignment in some way.

Is that a good way to remove unwanted reads prior to alignment, in order to speed up the processing? Maybe a trimming tool, or such?

RNA-Seq alignment • 3.1k views

ADD COMMENT • link 7.8 years ago by ognjen011 ▴ 290

score 3 · Accepted Answer · 2017-09-14

3

Entering edit mode

7.8 years ago

Brian Bushnell 20k

The most efficient way would be alignment-free kmer-matching. To do that, you can remove all reads containing ribosomal kmers with BBDuk (where ref is the ribosomal sequence of this organism):

bbduk.sh in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq ref=ribo.fa k=31 mm=f

You can alternatively gather a set of kmers that are present in ribosomal sequence but not in the rest of the genome and use that, but it's more complicated and probably not necessary.

ADD COMMENT • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you! Although this is exactly what I need, is there a free to use alternative? BBDuk seems to be a strictly commercial software.

ADD REPLY • link 7.8 years ago by ognjen011 ▴ 290

0

Entering edit mode

Sorry, I seem to be looking at the wrong package. Thanks for the answer!

ADD REPLY • link 7.8 years ago by ognjen011 ▴ 290

0

Entering edit mode

To make this clear to future readers, BBDuk is open-source, available here, and free for all uses.

ADD REPLY • link 7.8 years ago by Brian Bushnell 20k