Question: How to remove a list of reads from fastq file?
0
gravatar for Tao
17 months ago by
Tao170
Tao170 wrote:

Dear all,

I want to filter/remove some reads from a gzipped fastq file, according to a list of reads ID. How should I do?

I have one idea here. Say we have in.fastq.gz, reads-remove-list:

First, output the reads' ID of fastq file:

zcat in.fastq.gz | awk 'NR%4==1' | sed 's/@//' > in.fastq.readsID

Second, filter reads in reads-remove-list and get the remaining reads' ID:

grep -f reads-remove-list in.fastq.readsID -v > remaining.list

Third, use seqtk subseq extract the remaining reads:

seqtk subseq in.fastq.gz remaining.list | gzip - > remaining.fastq.gz

Do you guys have more direct/efficient solutions? just use some shell commands or other scripts?

Thanks! Tao

bash shell script fastq • 2.0k views
ADD COMMENTlink modified 17 months ago by gangireddy140 • written 17 months ago by Tao170
3
gravatar for Brian Bushnell
17 months ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

I suggest you try BBMap's filterbyname tool. It will handle names in many different formats, and supports inclusion or exclusion filtering. So, this would just be a single command.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Brian Bushnell15k

Thanks for your suggestion!

ADD REPLYlink written 17 months ago by Tao170
1
gravatar for gangireddy
17 months ago by
gangireddy140
gangireddy140 wrote:

try

qiime filter_fasta.py -f in.fastq -o out.fastq -s list_of_sequences.txt -n

@this link

ADD COMMENTlink modified 17 months ago • written 17 months ago by gangireddy140

Thanks! that's great choice!

ADD REPLYlink written 17 months ago by Tao170
0
gravatar for Tao
17 months ago by
Tao170
Tao170 wrote:

I wrote a python script to filter reads from gzipped fastq file and output as fasta.

Usage:

zcat in.fastq.gz | python filter.py reads-remove-list | gzip - > remaining.fastq.gz

filter.py

import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line
ADD COMMENTlink written 17 months ago by Tao170

How would you extend it to get also the quality scores?

ADD REPLYlink modified 6 months ago • written 6 months ago by Ric170
import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line
    if line_count%4 == 3 and switch == 1: # include + / - strand 
        print line
    if line_count%4 == 4 and switch == 1: # include quality score
        print line
ADD REPLYlink modified 6 months ago • written 6 months ago by Tao170
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 541 users visited in the last hour