Good tool/best practice for filtering out homopolymeric reads?
1
0
Entering edit mode
10 months ago
Sean ▴ 10

Hi there,

I have some reads that were generated from vanishingly low RNA input. I did first/second strand synthesis and an Illumina DNA Prep library preparation. I sent the samples for sequencing and (I assume) because there was such low input, there are a huge number of reads that are just 150 bp of A or 150 bp of T, etc. The end goal is to use kraken/metabuli for metagenomic classification of the reads, likely after filtering out human reads, so I'd rather save some compute and not waste time aligning total garbage when in some samples it accounts for 40% of the reads.

I'm wondering if there is an existing tool or best practice to filter out these reads that are very obviously trash. The read quality is very high, so I can't use quality filtering... I suppose I can write a quick script that just trashes anything with an unreasonably high fraction of nucleotides of one type, but hoping that this is something others have come across and dealt with in a more systematic way.

Thanks!

Sean

quality-control metagenomics data-cleaning • 547 views
ADD COMMENT
4
Entering edit mode
10 months ago
GenoMax 142k

Use bbduk.sh in filter mode to completely remove or trim mode to trim off homopolymers. Guide available here.

ADD COMMENT
0
Entering edit mode

Thanks so much, this is exactly what I needed!

ADD REPLY

Login before adding your answer.

Traffic: 2256 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6