Question

Good tool/best practice for filtering out homopolymeric reads?

0

Entering edit mode

10 months ago

Sean ▴ 10

Hi there,

I have some reads that were generated from vanishingly low RNA input. I did first/second strand synthesis and an Illumina DNA Prep library preparation. I sent the samples for sequencing and (I assume) because there was such low input, there are a huge number of reads that are just 150 bp of A or 150 bp of T, etc. The end goal is to use kraken/metabuli for metagenomic classification of the reads, likely after filtering out human reads, so I'd rather save some compute and not waste time aligning total garbage when in some samples it accounts for 40% of the reads.

I'm wondering if there is an existing tool or best practice to filter out these reads that are very obviously trash. The read quality is very high, so I can't use quality filtering... I suppose I can write a quick script that just trashes anything with an unreasonably high fraction of nucleotides of one type, but hoping that this is something others have come across and dealt with in a more systematic way.

Thanks!

Sean

quality-control metagenomics data-cleaning • 548 views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 10 months ago by Sean ▴ 10

score 4 · Accepted Answer · 2023-06-26

4

Entering edit mode

10 months ago

GenoMax 142k

Use bbduk.sh in filter mode to completely remove or trim mode to trim off homopolymers. Guide available here.