Removing reads shorter than 10kb from FASTQ files
3
0
Entering edit mode
6.8 years ago
Ric ▴ 430

Hi, How is it possible to remove reads shorter than 10kb from FASTQ files. I found how to do it FASTA files but not for FASTQ files

bioawk -c fastx '{ if(length($seq) > 10000) { print ">"$name; print $seq }}' bwa/unmapped_${output}.fasta > bwa/unmapped_${output}-gt-10000 .fasta

Thank you in advance.

fastq • 5.3k views
ADD COMMENT
1
Entering edit mode

Wow.

I am trying to think of an experiment/analysis that would be strictly benefited by removing reads under 10kbp, and I'm drawing blanks. I don't work with Nanopore often, but I do often work with PacBio, and... well... no, I can't imagine a scenario. There are scenarios in which software for long, low-accuracy reads gives better results when people throw away short reads. But that is always a flaw in the software (in which case, you should complain and demand better software, rather than throwing away data); and by "short", I mean >500bp or so. I think, if you throw away 10kbp reads for any experiment because they are too short, you're doing it wrong.

ADD REPLY
0
Entering edit mode

Hi, from my PacBio reads, I removed contaminations such chloroplast and some of the reads are very shortly afterwards. Do you think in this case it is a good idea to remove reads which are shorter than 500kb or 1000kb?

ADD REPLY
0
Entering edit mode

That depends on what you are doing, but generally no (I assume you mean bp, not kbp).

ADD REPLY
1
Entering edit mode
6.2 years ago

Via Essential AWK Commands for Next Generation Sequence Analysis:

$ awk 'NR%4==1{a=$0} NR%4==2{b=$0} NR%4==3{c=$0} NR%4==0&&length(b)>10000{print a"\n"b"\n"c"\n"$0;}' file.fq > result.fq

If your FASTQ is compressed:

$ gunzip -c file.fqz | awk 'NR%4==1{a=$0} NR%4==2{b=$0} NR%4==3{c=$0} NR%4==0&&length(b)>10000{print a"\n"b"\n"c"\n"$0;}' - | gzip -c - > result.fqz
ADD COMMENT
0
Entering edit mode
ADD COMMENT
0
Entering edit mode
6.8 years ago

You can do this using my NanoFilt tool. It's written for Oxford Nanopore sequencing data, but there is no reason that it wouldn't work for anything else. More information on filtering also on my blog Gigabase or gigabyte.

ADD COMMENT

Login before adding your answer.

Traffic: 1659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6