Filtering of RNA fastq files prior to aligment
1
0
Entering edit mode
3.6 years ago
ognjen011 ▴ 200

I have a pair of RNA fastq files that are heavily contaminated with rRNA sequences, to such extent that almost one half of reads originate from a 10K long sequence of reference. I want to remove these reads, and I know I can do it at BAM level in one of many ways (ike Samtools View), but I am curious if there is an efficient way of removing those reads at the fastq stage. The idea is to speed up the alignment in some way.

Is that a good way to remove unwanted reads prior to alignment, in order to speed up the processing? Maybe a trimming tool, or such?

RNA-Seq alignment • 1.5k views
3
Entering edit mode
3.6 years ago

The most efficient way would be alignment-free kmer-matching. To do that, you can remove all reads containing ribosomal kmers with BBDuk (where ref is the ribosomal sequence of this organism):

bbduk.sh in1=r1.fq in2=r2.fq out1=clean1.fq out2=clean2.fq ref=ribo.fa k=31 mm=f


You can alternatively gather a set of kmers that are present in ribosomal sequence but not in the rest of the genome and use that, but it's more complicated and probably not necessary.

0
Entering edit mode

Thank you! Although this is exactly what I need, is there a free to use alternative? BBDuk seems to be a strictly commercial software.

0
Entering edit mode

Sorry, I seem to be looking at the wrong package. Thanks for the answer!

0
Entering edit mode

To make this clear to future readers, BBDuk is open-source, available here, and free for all uses.