Question: How to remove a list of reads from fastq file?
0
gravatar for Tao
2.2 years ago by
Tao240
Tao240 wrote:

Dear all,

I want to filter/remove some reads from a gzipped fastq file, according to a list of reads ID. How should I do?

I have one idea here. Say we have in.fastq.gz, reads-remove-list:

First, output the reads' ID of fastq file:

zcat in.fastq.gz | awk 'NR%4==1' | sed 's/@//' > in.fastq.readsID

Second, filter reads in reads-remove-list and get the remaining reads' ID:

grep -f reads-remove-list in.fastq.readsID -v > remaining.list

Third, use seqtk subseq extract the remaining reads:

seqtk subseq in.fastq.gz remaining.list | gzip - > remaining.fastq.gz

Do you guys have more direct/efficient solutions? just use some shell commands or other scripts?

Thanks! Tao

bash shell script fastq • 3.3k views
ADD COMMENTlink modified 2.2 years ago by gangireddy160 • written 2.2 years ago by Tao240
3
gravatar for Brian Bushnell
2.2 years ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

I suggest you try BBMap's filterbyname tool. It will handle names in many different formats, and supports inclusion or exclusion filtering. So, this would just be a single command.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Brian Bushnell15k

Thanks for your suggestion!

ADD REPLYlink written 2.2 years ago by Tao240
1
gravatar for gangireddy
2.2 years ago by
gangireddy160
gangireddy160 wrote:

try

qiime filter_fasta.py -f in.fastq -o out.fastq -s list_of_sequences.txt -n

@this link

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by gangireddy160

Thanks! that's great choice!

ADD REPLYlink written 2.2 years ago by Tao240
0
gravatar for Tao
2.2 years ago by
Tao240
Tao240 wrote:

I wrote a python script to filter reads from gzipped fastq file and output as fasta.

Usage:

zcat in.fastq.gz | python filter.py reads-remove-list | gzip - > remaining.fastq.gz

filter.py

import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line
ADD COMMENTlink written 2.2 years ago by Tao240

How would you extend it to get also the quality scores?

ADD REPLYlink modified 16 months ago • written 16 months ago by Ric190
1
import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line
    if line_count%4 == 3 and switch == 1: # include + / - strand 
        print line
    if line_count%4 == 4 and switch == 1: # include quality score
        print line
ADD REPLYlink modified 16 months ago • written 16 months ago by Tao240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 930 users visited in the last hour