Question: How to filter sequencing reads in a fastq file by size?
0
gravatar for Beginner
10 days ago by
Beginner0
Beginner0 wrote:

How to filter sequencing reads in a fastq file by size? I want to filter a Ribo-Seq fastq file to only maintain sequences between 26 and 32 bp. The approaches I tried did all mess up the fastq line arrangement of the output fastq file. Also an old post that I found from 6 years ago did not help since the suggested options did not keep the fastq line arrangement.

Thank you for your help.

filter rna-seq size fastq ribo-seq • 161 views
ADD COMMENTlink modified 8 days ago by emeline.a.favreau10 • written 10 days ago by Beginner0

Also an old post that I found from 6 years ago did not help since the suggested options did not keep the fastq line arrangement.

which post ?

ADD REPLYlink written 10 days ago by Pierre Lindenbaum119k

Filtering Fastq Sequences Based On Lengths

This one. Didn't try the biopieces option though due to issued I had with one requuired library.

ADD REPLYlink written 10 days ago by Beginner0
1
gravatar for h.mon
10 days ago by
h.mon24k
Brazil
h.mon24k wrote:

reformat.sh from the BBTools package, see this post: filtering the reads based on the length

ADD COMMENTlink written 10 days ago by h.mon24k
1
gravatar for emeline.a.favreau
8 days ago by
London
emeline.a.favreau10 wrote:

I would use a combination of tools: seqtk, awk and cut. Each tool can be piped, so that the outcome of one tool is used by the second tool.

Code:

# Step 1: make a list of sequence names that are within the length range
seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | cut --fields 1 > selected-sequences-names.list

# Step 2: subset the fastq file for those sequences only
seqtk subseq my.fastq selected-sequences-names.list > subsetted.fq

Explanations:

seq comp my.fastq

will print a summary for each sequence found in the fastq file. The first column is the name of the sequence, the second column is the length of the sequence. See an example below:

seqtk comp my.fastq | head
ABCD-0123:987:GW1805231090:1:1101:13250:1696    27  52  14  21  63  0   0   0   4   0
ABCD-0123:987:GW1805231090:1:1101:20740:1766    27  46  19  21  64  0   0   0   10  0
ABCD-0123:987:GW1805231090:1:1101:16691:3004    56  44  14  22  10  0   0   0   3   0

awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }'

will print only the rows in which the second column (here coded $2) has values higher or equal to 26 ($2 >= 26) and lower or equal to 32 ($2 <= 32). See an example below:

seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | head  
ABCD-0123:987:GW1805231090:1:1101:13250:1696    27  52  14  21  63  0   0   0   4   0
ABCD-0123:987:GW1805231090:1:1101:20740:1766    27  46  19  21  64  0   0   0   10  0
ABCD-0123:987:GW1805231090:1:1101:15595:1784    27  51  27  34  38  0   0   0   12  0

cut --fields 1

will print the names of those sequences. They are found in the first column (here coded --fields 1). See an example below:

seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | cut --fields 1 | head  
ABCD-0123:987:GW1805231090:1:1101:13250:1696
ABCD-0123:987:GW1805231090:1:1101:20740:1766
ABCD-0123:987:GW1805231090:1:1101:15595:1784

To keep the names of the those sequences, the single greater-than (>) sign at the end of the command redirects the current output (here, the names of the sequences) into a new file (here, called selected-sequences-names.list)

seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | cut --fields 1 > selected-sequences-names.list
head selected-sequences-names.list
ABCD-0123:987:GW1805231090:1:1101:13250:1696
ABCD-0123:987:GW1805231090:1:1101:20740:1766
ABCD-0123:987:GW1805231090:1:1101:15595:1784

To actually subset the original fastq file for only those sequences, one command from seqtk is needed, and the new file (subsetted.fq) contains just what you wanted.

seqtk subseq my.fastq selected-sequences-names.list > subsetted.fq
ADD COMMENTlink written 8 days ago by emeline.a.favreau10

WOW! This was incredible helpful! Thank you for this great explanation!

ADD REPLYlink written 6 days ago by Beginner0

I tried it and it worked. Thank you for your help!

ADD REPLYlink written 6 days ago by Beginner0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1569 users visited in the last hour