Question: How to remove a list of reads from fastq file?
0
gravatar for Tao
23 months ago by
Tao220
Tao220 wrote:

Dear all,

I want to filter/remove some reads from a gzipped fastq file, according to a list of reads ID. How should I do?

I have one idea here. Say we have in.fastq.gz, reads-remove-list:

First, output the reads' ID of fastq file:

zcat in.fastq.gz | awk 'NR%4==1' | sed 's/@//' > in.fastq.readsID

Second, filter reads in reads-remove-list and get the remaining reads' ID:

grep -f reads-remove-list in.fastq.readsID -v > remaining.list

Third, use seqtk subseq extract the remaining reads:

seqtk subseq in.fastq.gz remaining.list | gzip - > remaining.fastq.gz

Do you guys have more direct/efficient solutions? just use some shell commands or other scripts?

Thanks! Tao

bash shell script fastq • 2.8k views
ADD COMMENTlink modified 23 months ago by gangireddy160 • written 23 months ago by Tao220
3
gravatar for Brian Bushnell
23 months ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

I suggest you try BBMap's filterbyname tool. It will handle names in many different formats, and supports inclusion or exclusion filtering. So, this would just be a single command.

ADD COMMENTlink modified 23 months ago • written 23 months ago by Brian Bushnell15k

Thanks for your suggestion!

ADD REPLYlink written 23 months ago by Tao220
1
gravatar for gangireddy
23 months ago by
gangireddy160
gangireddy160 wrote:

try

qiime filter_fasta.py -f in.fastq -o out.fastq -s list_of_sequences.txt -n

@this link

ADD COMMENTlink modified 23 months ago • written 23 months ago by gangireddy160

Thanks! that's great choice!

ADD REPLYlink written 23 months ago by Tao220
0
gravatar for Tao
23 months ago by
Tao220
Tao220 wrote:

I wrote a python script to filter reads from gzipped fastq file and output as fasta.

Usage:

zcat in.fastq.gz | python filter.py reads-remove-list | gzip - > remaining.fastq.gz

filter.py

import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line
ADD COMMENTlink written 23 months ago by Tao220

How would you extend it to get also the quality scores?

ADD REPLYlink modified 13 months ago • written 13 months ago by Ric180
1
import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line
    if line_count%4 == 3 and switch == 1: # include + / - strand 
        print line
    if line_count%4 == 4 and switch == 1: # include quality score
        print line
ADD REPLYlink modified 13 months ago • written 13 months ago by Tao220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1368 users visited in the last hour