Question

Subsetting fastq file based on reads starting with a specific pattern

0

Entering edit mode

2.8 years ago

biomon ▴ 60

Hi,

I would like to subset a fastq file and only extract reads that start with a specific pattern/ nucleotide sequence into new fastq file. Is there a good way I can do this?

Thanks in advance!

fastq subset • 1.2k views

ADD COMMENT • link 2.8 years ago by biomon ▴ 60

0

Entering edit mode

seqkit locate is probably what you're looking for.

ADD REPLY • link 2.8 years ago by Dunois ★ 2.5k

0

Entering edit mode

Thanks, I'll give it a go and play around with it. I had a look at the manual, I can only see examples with fasta input, no fastq. The output examples were also not in fastq. Maybe I have missed something.

ADD REPLY • link 2.8 years ago by biomon ▴ 60

1

Entering edit mode

Hmm fair point. I don't see anything about fastq files either. That's really unfortunate then.

Maybe something like this then:

import re
import sys
from Bio import SeqIO
from os import path


inpfile = sys.argv[1]
outfile = path.dirname(path.realpath(inpfile)) + '/' + path.basename(inpfile).replace('.fastq', '_filt.fastq')

mypat = sys.argv[2]
myregex = re.compile(mypat)


with open(inpfile, 'r') as fqfile:
    with open(outfile, 'w') as outfq:
        for record in SeqIO.parse(fqfile, 'fastq'):
            if(re.match(myregex, str(record.seq))):
                print("Matched:\n", record.id, "\n", record.seq)
                SeqIO.write(record, outfq, 'fastq')

print("Done!!\n")

Put that in a file (e.g., filename.py), invoke it with python3 like so:

python3 filename.py input.fastq "^ATGC"

It should create an input_filt.fastq file in the same directory as input.fastq that contains all sequences matching the pattern you supplied to it ("^ATGC"in this example).

(You'll need to have Biopython installed for this little script to work.)

ADD REPLY • link 2.8 years ago by Dunois ★ 2.5k

1

Entering edit mode

Great, thanks I will have a go!

ADD REPLY • link 2.8 years ago by biomon ▴ 60