Question

Match header (list) to sequence in fasta file quickly?

0

Entering edit mode

18 months ago

Eliveri ▴ 350

Hi. I have a list of headers for which I want to find the sequence in a large fastq file to write a new fastq file.

List of headers list.txt

@A00111:399:HWLKHDSXX:2:1101:1000:7434
@A00111:399:HWLKHDSXX:2:1101:1000:...
@A00111:399:HWLKHDSXX:2:1101:1000:...

fastq file largefile.fastq

@A00111:399:HWLKHDSXX:2:1101:10004:332111
AACAAGTGATAATCAGAGAGTTCTCACAGGTTCTCACTGATAATGATAAAGGTTCTCACAG...
@A00111:399:HW...

I am currently using the following command, but it takes a really long time.

while read HEADER
do
    grep -A 1 -m 1 $HEADER largefile.fastq >> new.fastq
done < headers list.txt

Is there a faster way of doing this?

fasta grep • 744 views

ADD COMMENT • link 18 months ago by Eliveri ▴ 350

0

Entering edit mode

Thank you all for the help! In the end I chose to write a short python script rather than use bash as the reference fasta is not so large that it cannot be read into memory.

my_dict = {}

with open('reference.fasta', 'r') as file:    
    for line in file:
        line2 = next(file, None)
        my_dict[line] = line2

my_list = open("headers_list.txt","r")
new_fasta = open("new_file.fasta","a")

for item in my_list:
    if item in my_dict: 
        new_fasta.write(item + my_dict[item])

ADD REPLY • link 18 months ago by Eliveri ▴ 350

1

Entering edit mode

18 months ago

GenoMax 142k

Use filterbyname.sh from BBMap suite. This would be a better solution than one @Pierre mentions above, if you have paired-end data files.

Description:  Filters reads by name.

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12

ADD COMMENT • link 18 months ago by GenoMax 142k

score 2 · Accepted Answer · 2022-11-07

2

Entering edit mode

18 months ago

Pierre Lindenbaum 161k

LC_ALL=C grep -F -f list.txt --no-group-separator  -A1 largefile.fastq  > new.fastq

ADD COMMENT • link 18 months ago by Pierre Lindenbaum 161k