Search a read by its name in a big fastq.gz file
3
0
Entering edit mode
4 months ago
xiaoleiusc ▴ 140

Dear All,

How can I search for a read by part of its name in a big fastq.gz file (size around 13GB)? For example, I would like to search for a read name containing the VH01677:31:AACCMFHHV:1:1101:6586:25290 string in a fastq.gz file. For a small-size fastq.gz file, I just use the gunzip command in the Mac OS terminal to unzip the file, open it with the text editor, and use the "Ctrl + F" key to search the read. But for a big-size fastq.gz file, I do not want to do it this way as it is very inefficient.

Thanks,
Xiao

NGS fastq • 788 views
ADD COMMENT
3
Entering edit mode
4 months ago

Plain grep should work.

gzip -dc input.fastq.gz | grep -A3 'VH01677:31:AACCMFHHV:1:1101:6586:25290' | gzip > match.fastq.gz

You could try seqkit grep also if you want to use a more formal fastq parser.

seqkit grep -rp 'VH01677:31:AACCMFHHV:1:1101:6586:25290' -o match.fastq.gz input.fastq.gz
ADD COMMENT
0
Entering edit mode

Hi, rpolicastro,

Thank you very much! I tried seqkit grep and it works! It seems to me you changed the arguments from -Irp to just -rp.

Xiao

ADD REPLY
0
Entering edit mode

I'm glad it worked! Whether or not you include -I as an argument you'll get the same results, so I decided to edit my post and remove it just to simplify the answer.

ADD REPLY
1
Entering edit mode

I want to print the matched read on the screen, so I use the -Irp argument without the -o argument, and it works. Thanks again!

Xiao

ADD REPLY
3
Entering edit mode
4 months ago
GenoMax 141k

You can also use filterbyname.sh from BBMap suite.

$ filterbyname.sh -Xmx2g in=file_R1_001.fastq.gz names=MXXXXX:469:000000000-XXXX7:1:1101:16837:2353 out=stdout.fq include=t

names= A list of strings or files.  The files can have one name per line, or be a standard read file 

You can also do substring matches, case-sensitive matches etc.Note that the read name being searched for should not have the beginning @.

ADD COMMENT
2
Entering edit mode
4 months ago
Ram 43k

Another option is seqtk subseq: https://github.com/lh3/seqtk

With a list of names in a file named name.lst, you can use the following command to extract the corresponding read(s):

seqtk subseq in.fq name.lst > out.fq
ADD COMMENT

Login before adding your answer.

Traffic: 1657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6