How to remove a list of reads from fastq file?
3
2
Entering edit mode
4.8 years ago
Tao ▴ 420

Dear all,

I want to filter/remove some reads from a gzipped fastq file, according to a list of reads ID. How should I do?

I have one idea here. Say we have in.fastq.gz, reads-remove-list:

First, output the reads' ID of fastq file:

zcat in.fastq.gz | awk 'NR%4==1' | sed 's/@//' > in.fastq.readsID

Second, filter reads in reads-remove-list and get the remaining reads' ID:

grep -f reads-remove-list in.fastq.readsID -v > remaining.list

Third, use seqtk subseq extract the remaining reads:

seqtk subseq in.fastq.gz remaining.list | gzip - > remaining.fastq.gz

Do you guys have more direct/efficient solutions? just use some shell commands or other scripts?

Thanks! Tao

fastq shell bash script • 7.1k views
ADD COMMENT
5
Entering edit mode
4.8 years ago

I suggest you try BBMap's filterbyname tool. It will handle names in many different formats, and supports inclusion or exclusion filtering. So, this would just be a single command.

ADD COMMENT
0
Entering edit mode

Thanks for your suggestion!

ADD REPLY
1
Entering edit mode
4.8 years ago
gangireddy ▴ 160

try

qiime filter_fasta.py -f in.fastq -o out.fastq -s list_of_sequences.txt -n

@this link

ADD COMMENT
0
Entering edit mode

Thanks! that's great choice!

ADD REPLY
0
Entering edit mode
4.8 years ago
Tao ▴ 420

I wrote a python script to filter reads from gzipped fastq file and output as fasta.

Usage:

zcat in.fastq.gz | python filter.py reads-remove-list | gzip - > remaining.fastq.gz

filter.py

import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line
ADD COMMENT
0
Entering edit mode

How would you extend it to get also the quality scores?

ADD REPLY
1
Entering edit mode
import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line
    if line_count%4 == 3 and switch == 1: # include + / - strand 
        print line
    if line_count%4 == 4 and switch == 1: # include quality score
        print line
ADD REPLY
0
Entering edit mode

Hello, I'm trying this but (i) the quality score information isn't be returned and (ii) the '@' from the read name line is missing. Looking at the script, I can't figure out why this would be -- think you can help update it again?

ADD REPLY

Login before adding your answer.

Traffic: 2422 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6