Question: Filtering Fastq Sequences Based On Lengths
6
gravatar for empyrean999
6.6 years ago by
empyrean999160
Canada
empyrean999160 wrote:

As the question says,

I have a fastq file from small RNA sequencing with sequence lengths between 15 - 30. I wanted to filter sequence lengths between 21-25 and write to another file. how can i do that?

perl unix awk • 17k views
ADD COMMENTlink modified 1 day ago by komjinhubuio10 • written 6.6 years ago by empyrean999160
14
gravatar for Wen.Huang
6.6 years ago by
Wen.Huang1.2k
Wen.Huang1.2k wrote:
cat your.fastq | paste - - - - | awk 'length($2)  >= 21 && length($2) <= 25' | sed 's/\t/\n/g' > filtered.fastq
ADD COMMENTlink written 6.6 years ago by Wen.Huang1.2k
3

If I may, a pure Awk command is twice faster:

awk 'BEGIN {FS = "\t" ; OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 21 && length(seq) <= 25) {print header, seq, qheader, qseq}}' < your.fastq > filtered.fastq

Like your solution with paste, it assumes that a fastq record takes exactly 4 lines.

Edit: deal with spaces in sequence names, as suggested by brianpenghe.

ADD REPLYlink modified 10 months ago • written 6.0 years ago by Frédéric Mahé2.9k

How would this be modified for gzip's fastqs?

ADD REPLYlink written 3.3 years ago by shanasabri40

zcat decompresses the data of all the input files, and writes the result on the standard output. zcat concatenates the data in the same way cat do

zcat your.fastq.gz | ...
ADD REPLYlink modified 2.4 years ago • written 3.1 years ago by Medhat8.5k

....................removed....................

ADD REPLYlink modified 10 months ago • written 2.4 years ago by shanasabri40

Like, a normal for loop?

for fastq in *.fastq
do
awk ... < $fastq > filtered_$fastq
done
ADD REPLYlink written 2.4 years ago by WouterDeCoster41k

One note: This command doesn't work when the read names contain spaces. better use awk -F"\t" instead of awk

ADD REPLYlink written 10 months ago by brianpenghe20

This is just printing zero lines! Note I changed constraints to either >=16 only, or >=16 && <= 500.

ADD REPLYlink modified 9 months ago • written 9 months ago by caverill40
3
gravatar for Martin A Hansen
6.6 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Using Biopieces www.biopieces.org)

read_fastq -i in.fq | grab -e 'SEQ_LEN>=21' | grab -e 'SEQ_LEN<=25' | write_fastq -o out.fq -x

And when you realize that you want to do a lot of extra things besides filtering on sequence length you will find lots of useful tools in Biopieces.

ADD COMMENTlink written 6.6 years ago by Martin A Hansen3.0k

Is grab the new grep? ;-)

ADD REPLYlink written 6.6 years ago by Christof Winter990

Not quite: https://code.google.com/p/biopieces/wiki/grab

ADD REPLYlink written 6.6 years ago by Martin A Hansen3.0k
2
gravatar for Philipp Bayer
6.6 years ago by
Philipp Bayer6.5k
Australia/Perth/UWA
Philipp Bayer6.5k wrote:

Edit: If you're coming via Google, my answer is very, very old. Consider komjinhubuio's answer, seqkit is the 'modern' way to go

Using the awesome readfq-library in perl, and their modified example:

  my @aux = undef; # this is for keeping intermediate data
  while (my ($name, $seq, $qual) = readfq(\*STDIN, \@aux)) { 
     if( (length($seq) >= 21) && (length($seq) <= 25) ) { 
         print "@$name\n";
         print "$seq\n"; 
         print "+\n";
         print "$qual\n";
     }
  }

(Beware: Haven't tested this yet)

ADD COMMENTlink modified 12 hours ago • written 6.6 years ago by Philipp Bayer6.5k
2

You are missing a closing bracket, a "n" in the last line (there's also an unnecessary comma), and use strict; use warnings; which will tell you such things :)

ADD REPLYlink modified 6.6 years ago • written 6.6 years ago by SES8.2k

:) Thank you! I usually never use Perl.

ADD REPLYlink written 6.6 years ago by Philipp Bayer6.5k
2
gravatar for etienne.rifa
3.1 years ago by
etienne.rifa20
etienne.rifa20 wrote:

You can easily do this with prinseq-lite:

FILTER OPTIONS

-min_len <integer>
        Filter sequence shorter than min_len.

-max_len <integer>
        Filter sequence longer than max_len.

prinseq-lite.pl -fastq yourfile.fastq -out_format 4 -out_good seqs_good -min_len 21 -trim_to_len 25

http://prinseq.sourceforge.net/manual.html

ADD COMMENTlink written 3.1 years ago by etienne.rifa20
1
gravatar for komjinhubuio
1 day ago by
komjinhubuio10
komjinhubuio10 wrote:

I would like to introduce you to a powerful software: seqkit (https://bioinf.shenwei.me/seqkit/usage/), with which you can easily manipulate fastq/fasta format seq. seqkit seq youseq.fastq -m 21 -M 25 > result.fq

ADD COMMENTlink written 1 day ago by komjinhubuio10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1258 users visited in the last hour