Match header (list) to sequence in fasta file quickly?
2
0
Entering edit mode
18 months ago
Eliveri ▴ 350

Hi. I have a list of headers for which I want to find the sequence in a large fastq file to write a new fastq file.

List of headers list.txt

@A00111:399:HWLKHDSXX:2:1101:1000:7434
@A00111:399:HWLKHDSXX:2:1101:1000:...
@A00111:399:HWLKHDSXX:2:1101:1000:...

fastq file largefile.fastq

@A00111:399:HWLKHDSXX:2:1101:10004:332111
AACAAGTGATAATCAGAGAGTTCTCACAGGTTCTCACTGATAATGATAAAGGTTCTCACAG...
@A00111:399:HW...

I am currently using the following command, but it takes a really long time.

while read HEADER
do
    grep -A 1 -m 1 $HEADER largefile.fastq >> new.fastq
done < headers list.txt

Is there a faster way of doing this?

fasta grep • 744 views
ADD COMMENT
0
Entering edit mode

Thank you all for the help! In the end I chose to write a short python script rather than use bash as the reference fasta is not so large that it cannot be read into memory.

my_dict = {}

with open('reference.fasta', 'r') as file:    
    for line in file:
        line2 = next(file, None)
        my_dict[line] = line2

my_list = open("headers_list.txt","r")
new_fasta = open("new_file.fasta","a")

for item in my_list:
    if item in my_dict: 
        new_fasta.write(item + my_dict[item])
ADD REPLY
2
Entering edit mode
18 months ago
LC_ALL=C grep -F -f list.txt --no-group-separator  -A1 largefile.fastq  > new.fastq
ADD COMMENT
1
Entering edit mode
18 months ago
GenoMax 142k

Use filterbyname.sh from BBMap suite. This would be a better solution than one @Pierre mentions above, if you have paired-end data files.

Description:  Filters reads by name.

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12
ADD COMMENT

Login before adding your answer.

Traffic: 1726 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6