Extracting specific sequences from FASTQ using Seqtk
1
0
Entering edit mode
5.5 years ago
omer.k ▴ 110

I have a large fastq file containing NGS sequences.

Using prior tools I've been able to narrow down the names of some 2,500 reads which are of interest to me, in OutputFile.txt.

In this file there are names of the reads ordered in the following patter:

1123    084b5b69-e819-426f-9ff3-fea4891af330 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2290 ch=184 start_time=2018-03-27T11:36:09Z
1124    5c621711-9233-40a3-a544-419fe5135d83 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2382 ch=320 start_time=2018-03-27T11:35:19Z

etc...

I'd like to use this OutputFile.txt to extract only the names of the reads and the sequences from the original fastq file into a new filtered fastq file

I have Seqtk installed on windows, running via Command Line. I believe I could use Seqtk grep to extract only those 2,500 reads. I'm having trouble though with the proper command and the flags.

Could you please advise?

SeqKit • 9.6k views
ADD COMMENT
1
Entering edit mode

I think BBMap do this job, with filterbyname.sh script

filterbyname.sh in=reads.fq out=filtered.fq names=names.txt include=t

From this thread

Edit :

Also possible with seqtk subseq :

seqtk subseq  in.fastq  list.txt  > out.fastq

From here

ADD REPLY
0
Entering edit mode

Thanks @Bastien. But I have tried it before and it gives an error:

[ERRO] fastx: invalid FASTA/Q format

Any idea why this happens?

ADD REPLY
0
Entering edit mode

Maybe it is related to nanopore data, space in the headers... I never process these kind of data. If BBmap and seqtk cannot manage those data I cannot do much more sadly.

You have the best ones on the case, see below , I will not say better :)

ADD REPLY
0
Entering edit mode

There is always Biopython

ADD REPLY
0
Entering edit mode

Is this nanopore data? I wonder if the tools are having problems with the spaces in the names of fastq headers. You could certainly try filterbyname.sh or otherwise change the spaces to _ temporarily and then use the tools mentioned by Bastien.

ADD REPLY
0
Entering edit mode

This is indeed nanopore data. Could these spaces in the header of the nanopore data be the reason for the error I'm getting (see my reply to Bastien's suggestion)?

ADD REPLY
0
Entering edit mode

While I did not spend a lot of time it appears that neither filterbyname.sh or seqtk subseq work as intended with nanopore fastq data. I have asked @Brian about BBMap.

Also checking with @Wouter deCoster to see if his nanofilt tool can be easily modified to do this task.

ADD REPLY
0
Entering edit mode

The 084b5b69-e819-426f-9ff3-fea4891af330 portion should be a unique identifier. Therefore you don't need the other parts of your txt file/read names for filtering. I'd assume filterbyname.sh should work if you only use those unique parts.

ADD REPLY
0
Entering edit mode

Thanks Wouter, I understand your tip. However, what would be a fast way to filter the file which contains all the reads of my interest (that is, the file which I'd liketo guide the filtering of the FASTQ file), and keep just the identifiers, i.e. 084b5b69-e819-426f-9ff3-fea4891af330?

ADD REPLY
1
Entering edit mode
5.5 years ago
GenoMax 141k

Simplest thing to do is put the patterns you need (from that cut command) in a file.

$ cat name.txt 
bba1f3f9-6b93-48b5-8c99-9c0d3eafa232
caa54c6e-bab0-4d7a-acd6-cad8011bdd10
125a259c-1937-47c1-9b17-5d88d833784f

then use grep to pull sequences out.

$ grep -f name.txt -A 3 your.fq > seq_you_need.fq
ADD COMMENT
0
Entering edit mode

Hi genomax,

How would you extract though only the 084b5b69-e819-426f-9ff3-fea4891af330 from a line such as

1123    084b5b69-e819-426f-9ff3-fea4891af330 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2290 ch=184 start_time=2018-03-27T11:36:09Z

?

Or even from

084b5b69-e819-426f-9ff3-fea4891af330 runid=d20680f0495125cc465d6a96efb49b194cb0777b read=2290 ch=184 start_time=2018-03-27T11:36:09Z

(I can remove rather easily just the first string which are the prefix-numbers)?

ADD REPLY
0
Entering edit mode

If records are space separated then cut -d ' ' -f2 your_file > id_you_need should pull out that second column. If records are tab separate then just cut -f2 your_file > id_you_need should work.

ADD REPLY
0
Entering edit mode

Thanks, I've read online and used exactly this command (cut) to process the lines. Now I have a file with only the unique identifiers, which will guide the filtering of the FASTQ file. Will update.

ADD REPLY
0
Entering edit mode

Filtering by grep was successful. I used $ grep -f name.txt -A 3 your.fq > seq_you_need.fq as suggested by genomax. Thanks.

ADD REPLY
0
Entering edit mode

A few options to make things faster:

  1. -F to use "fixed" patterns, i.e. - tell grep not to bother with regexes
  2. -x if name.txt to match by whole line contains full read names i.e. line 1 of target read fully matches the corresponding entry in name.txt (not relevant in this case but could be relevant if complete read names are being used)
ADD REPLY

Login before adding your answer.

Traffic: 2656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6