How to filter sequencing reads in a fastq file by size?

How to filter sequencing reads in a fastq file by size?

0

Entering edit mode

5.0 years ago

Beginner ▴ 80

How to filter sequencing reads in a fastq file by size?

I want to filter a Ribo-Seq fastq file to only maintain sequences between 26 and 32 bp. The approaches I tried did all mess up the fastq line arrangement of the output fastq file. Also an old post that I found from 6 years ago did not help since the suggested options did not keep the fastq line arrangement.

Thank you for your help

Ribo-Seq RNA-Seq fastq • 6.9k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 5.0 years ago by Beginner ▴ 80

0

Entering edit mode

Also an old post that I found from 6 years ago did not help since the suggested options did not keep the fastq line arrangement.

which post ?

ADD REPLY • link 5.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Filtering Fastq Sequences Based On Lengths

This one. Didn't try the biopieces option though due to issued I had with one requuired library.

ADD REPLY • link 5.0 years ago by Beginner ▴ 80

2

Entering edit mode

5.0 years ago

emeline.a.favreau ▴ 30

I would use a combination of tools: seqtk, awk and cut. Each tool can be piped, so that the outcome of one tool is used by the second tool.

Code:

# Step 1: make a list of sequence names that are within the length range
seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | cut --fields 1 > selected-sequences-names.list

# Step 2: subset the fastq file for those sequences only
seqtk subseq my.fastq selected-sequences-names.list > subsetted.fq

Explanations:

seq comp my.fastq

will print a summary for each sequence found in the fastq file. The first column is the name of the sequence, the second column is the length of the sequence. See an example below:

seqtk comp my.fastq | head
ABCD-0123:987:GW1805231090:1:1101:13250:1696    27  52  14  21  63  0   0   0   4   0
ABCD-0123:987:GW1805231090:1:1101:20740:1766    27  46  19  21  64  0   0   0   10  0
ABCD-0123:987:GW1805231090:1:1101:16691:3004    56  44  14  22  10  0   0   0   3   0

awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }'

will print only the rows in which the second column (here coded $2) has values higher or equal to 26 ($2 >= 26) and lower or equal to 32 ($2 <= 32). See an example below:

seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | head  
ABCD-0123:987:GW1805231090:1:1101:13250:1696    27  52  14  21  63  0   0   0   4   0
ABCD-0123:987:GW1805231090:1:1101:20740:1766    27  46  19  21  64  0   0   0   10  0
ABCD-0123:987:GW1805231090:1:1101:15595:1784    27  51  27  34  38  0   0   0   12  0

cut --fields 1

will print the names of those sequences. They are found in the first column (here coded --fields 1). See an example below:

seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | cut --fields 1 | head  
ABCD-0123:987:GW1805231090:1:1101:13250:1696
ABCD-0123:987:GW1805231090:1:1101:20740:1766
ABCD-0123:987:GW1805231090:1:1101:15595:1784

To keep the names of the those sequences, the single greater-than (>) sign at the end of the command redirects the current output (here, the names of the sequences) into a new file (here, called selected-sequences-names.list)

seqtk comp my.fastq | awk '{ if (($2 >= 26) && ($2 <= 32)) { print} }' | cut --fields 1 > selected-sequences-names.list
head selected-sequences-names.list
ABCD-0123:987:GW1805231090:1:1101:13250:1696
ABCD-0123:987:GW1805231090:1:1101:20740:1766
ABCD-0123:987:GW1805231090:1:1101:15595:1784

To actually subset the original fastq file for only those sequences, one command from seqtk is needed, and the new file (subsetted.fq) contains just what you wanted.

seqtk subseq my.fastq selected-sequences-names.list > subsetted.fq

ADD COMMENT • link 5.0 years ago by emeline.a.favreau ▴ 30

0

Entering edit mode

WOW! This was incredible helpful! Thank you for this great explanation!

ADD REPLY • link 5.0 years ago by Beginner ▴ 80

0

Entering edit mode

I tried it and it worked. Thank you for your help!

ADD REPLY • link 5.0 years ago by Beginner ▴ 80

1

Entering edit mode

5.0 years ago

h.mon 35k

reformat.sh from the BBTools package, see this post: filtering the reads based on the length

ADD COMMENT • link 5.0 years ago by h.mon 35k

Login before adding your answer.

Similar Posts

Loading Similar Posts

Traffic: 2577 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6