Question

Identify FASTQ files containing multiple DNA sequences based on specific sequence ID's

0

Entering edit mode

3.4 years ago

tahir • 0

I have 500 files, each file containing 4000 DNA sequences in FASTQ format. I have 20 sequence names or ID's extracted from BAM files based on alignments with DNA sequences in 500 FASTQ files. I want to identify which of the 500 files contain DNA sequences corresponding to 20 sequence ID's. Ultimately I want to eliminate the sequences from files corresponding to sequence ID's and resave the files. Please guide/help.

sequence fastq file filtering • 883 views

ADD COMMENT • link 3.4 years ago by tahir • 0

0

Entering edit mode

OK, thanks again. I tested and it worked for one fastqfile with list of id's. I do not know how to make it work and scan all 500 files! I used wild card (*.fastq) for in and out but gave error. Unless this command can be used in a sccript and loop. Ideas?

ADD REPLY • link 3.4 years ago by tahir • 0

0

Entering edit mode

I wrote this loop and it worked. Thanks for tips/help

for i in *.fastq ; do filterbyname.sh in=$i out=removed/$i names=removelist.txt include=f; done;

ADD REPLY • link updated 3.4 years ago by Ram 43k • written 3.4 years ago by tahir • 0

0

Entering edit mode

This is not a novel answer, it is a loop that uses an existing answer. You should add it as a comment to the answer, not as a new answer.

ADD REPLY • link 3.4 years ago by Ram 43k

score 1 · Accepted Answer · 2020-12-15

1

Entering edit mode

3.4 years ago

GenoMax 141k

If you have the sequence names/ID then this is easy to do using filterbyname.sh from BBMap suite. Take a look at full in-line help (portion pasted below). names= can also be a file with multiple readID (one per line).

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12

Parameters:
include=f       Set to 'true' to include the filtered names rather than excluding them.

ADD COMMENT • link 3.4 years ago by GenoMax 141k

0

Entering edit mode

I work with nanopore data, so not paired. I will try and see. Thanks I did try seqkit with seqkit grep -n -f removelist.txt -v test.fastq -o test.clean.fastq It does not not work for me. No error but saves the files as such with no filtering of reads.

ADD REPLY • link 3.4 years ago by tahir • 0

0

Entering edit mode

Then just use in= and out=. Make sure the names are stripped of the leading @. You can also do partial matches (I know nanopore headers are super long at times).

ADD REPLY • link 3.4 years ago by GenoMax 141k