Identify FASTQ files containing multiple DNA sequences based on specific sequence ID's
1
0
Entering edit mode
3.4 years ago
tahir • 0

I have 500 files, each file containing 4000 DNA sequences in FASTQ format. I have 20 sequence names or ID's extracted from BAM files based on alignments with DNA sequences in 500 FASTQ files. I want to identify which of the 500 files contain DNA sequences corresponding to 20 sequence ID's. Ultimately I want to eliminate the sequences from files corresponding to sequence ID's and resave the files. Please guide/help.

sequence fastq file filtering • 883 views
ADD COMMENT
0
Entering edit mode

OK, thanks again. I tested and it worked for one fastqfile with list of id's. I do not know how to make it work and scan all 500 files! I used wild card (*.fastq) for in and out but gave error. Unless this command can be used in a sccript and loop. Ideas?

ADD REPLY
0
Entering edit mode

I wrote this loop and it worked. Thanks for tips/help

for i in *.fastq ; do filterbyname.sh in=$i out=removed/$i names=removelist.txt include=f; done;
ADD REPLY
0
Entering edit mode

This is not a novel answer, it is a loop that uses an existing answer. You should add it as a comment to the answer, not as a new answer.

ADD REPLY
1
Entering edit mode
3.4 years ago
GenoMax 141k

If you have the sequence names/ID then this is easy to do using filterbyname.sh from BBMap suite. Take a look at full in-line help (portion pasted below). names= can also be a file with multiple readID (one per line).

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12

Parameters:
include=f       Set to 'true' to include the filtered names rather than excluding them.
ADD COMMENT
0
Entering edit mode

I work with nanopore data, so not paired. I will try and see. Thanks I did try seqkit with seqkit grep -n -f removelist.txt -v test.fastq -o test.clean.fastq It does not not work for me. No error but saves the files as such with no filtering of reads.

ADD REPLY
0
Entering edit mode

Then just use in= and out=. Make sure the names are stripped of the leading @. You can also do partial matches (I know nanopore headers are super long at times).

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6