Subsetting fastq file based on reads starting with a specific pattern
0
0
Entering edit mode
4 months ago
biomon ▴ 60

Hi,

I would like to subset a fastq file and only extract reads that start with a specific pattern/ nucleotide sequence into new fastq file. Is there a good way I can do this?

Thanks in advance!

fastq subset • 342 views
ADD COMMENT
0
Entering edit mode

seqkit locate is probably what you're looking for.

ADD REPLY
0
Entering edit mode

Thanks, I'll give it a go and play around with it. I had a look at the manual, I can only see examples with fasta input, no fastq. The output examples were also not in fastq. Maybe I have missed something.

ADD REPLY
1
Entering edit mode

Hmm fair point. I don't see anything about fastq files either. That's really unfortunate then.

Maybe something like this then:

import re
import sys
from Bio import SeqIO
from os import path


inpfile = sys.argv[1]
outfile = path.dirname(path.realpath(inpfile)) + '/' + path.basename(inpfile).replace('.fastq', '_filt.fastq')

mypat = sys.argv[2]
myregex = re.compile(mypat)


with open(inpfile, 'r') as fqfile:
    with open(outfile, 'w') as outfq:
        for record in SeqIO.parse(fqfile, 'fastq'):
            if(re.match(myregex, str(record.seq))):
                print("Matched:\n", record.id, "\n", record.seq)
                SeqIO.write(record, outfq, 'fastq')

print("Done!!\n")

Put that in a file (e.g., filename.py), invoke it with python3 like so:

python3 filename.py input.fastq "^ATGC"

It should create an input_filt.fastq file in the same directory as input.fastq that contains all sequences matching the pattern you supplied to it ("^ATGC"in this example).

(You'll need to have Biopython installed for this little script to work.)

ADD REPLY
1
Entering edit mode

Great, thanks I will have a go!

ADD REPLY

Login before adding your answer.

Traffic: 1636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6