Question

How to remove a list of reads from fastq file?

3

Entering edit mode

7.8 years ago

Tao ▴ 530

Dear all,

I want to filter/remove some reads from a gzipped fastq file, according to a list of reads ID. How should I do?

I have one idea here. Say we have in.fastq.gz, reads-remove-list:

First, output the reads' ID of fastq file:

zcat in.fastq.gz | awk 'NR%4==1' | sed 's/@//' > in.fastq.readsID

Second, filter reads in reads-remove-list and get the remaining reads' ID:

grep -f reads-remove-list in.fastq.readsID -v > remaining.list

Third, use seqtk subseq extract the remaining reads:

seqtk subseq in.fastq.gz remaining.list | gzip - > remaining.fastq.gz

Do you guys have more direct/efficient solutions? just use some shell commands or other scripts?

Thanks! Tao

fastq shell bash script • 13k views

ADD COMMENT • link updated 17 months ago by Viki • 0 • written 7.8 years ago by Tao ▴ 530

score 5 · Answer 1 · 2016-07-04

5

Entering edit mode

7.8 years ago

Brian Bushnell 20k

I suggest you try BBMap's filterbyname tool. It will handle names in many different formats, and supports inclusion or exclusion filtering. So, this would just be a single command.

ADD COMMENT • link 7.8 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for your suggestion!

ADD REPLY • link 7.8 years ago by Tao ▴ 530

score 1 · Answer 2 · 2016-07-05

1

Entering edit mode

7.8 years ago

gangireddy ▴ 160

try

qiime filter_fasta.py -f in.fastq -o out.fastq -s list_of_sequences.txt -n

@this link

ADD COMMENT • link 7.8 years ago by gangireddy ▴ 160

0

Entering edit mode

Thanks! that's great choice!

ADD REPLY • link 7.8 years ago by Tao ▴ 530

score 0 · Answer 3 · 2016-07-04

0

Entering edit mode

7.8 years ago

Tao ▴ 530

I wrote a python script to filter reads from gzipped fastq file and output as fasta.

Usage:

zcat in.fastq.gz | python filter.py reads-remove-list | gzip - > remaining.fastq.gz

filter.py

import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line

ADD COMMENT • link 7.8 years ago by Tao ▴ 530

0

Entering edit mode

How would you extend it to get also the quality scores?

ADD REPLY • link 6.9 years ago by Ric ▴ 430

1

Entering edit mode

import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)
remove_list = set(remove_list)

switch = 1;
line_count = 0
for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            line = line[1:]
        else:
            print "error read Id", line
            sys.exit(1)
        if line in remove_list:
            switch = 0
        else:
            switch = 1
            print line
    if line_count%4 == 2 and switch == 1:
        print line
    if line_count%4 == 3 and switch == 1: # include + / - strand 
        print line
    if line_count%4 == 4 and switch == 1: # include quality score
        print line

ADD REPLY • link 6.9 years ago by Tao ▴ 530

0

Entering edit mode

Hello, I'm trying this but (i) the quality score information isn't be returned and (ii) the '@' from the read name line is missing. Looking at the script, I can't figure out why this would be -- think you can help update it again?

ADD REPLY • link 3.9 years ago by Jautis ▴ 530

0

Entering edit mode

Hi, I know this was a few years ago now but this is my fix:

import fileinput
import sys
import os
reads_to_remove_path = sys.argv[1]

fp = open(reads_to_remove_path)
remove_list = []
while True:
    line = fp.readline()
    line = line.strip()
    if not line:
        break
    remove_list.append(line)

remove_list = set(remove_list)

switch = 1;
line_count = 0

for line_in in sys.stdin:

    line_count += 1
    line = line_in.strip()
    if line_count%4 == 1:
        if line.startswith('@'):
            if line in remove_list:
                switch = 0
            else:
                switch = 1
        else:
            print ("error read Id", line)
            sys.exit(1)

    if switch == 1:
        print (line)

ADD REPLY • link 17 months ago by Viki • 0