Question: Identify FASTQ files containing multiple DNA sequences based on specific sequence ID's
gravatar for tahir
6 weeks ago by
tahir0 wrote:

I have 500 files, each file containing 4000 DNA sequences in FASTQ format. I have 20 sequence names or ID's extracted from BAM files based on alignments with DNA sequences in 500 FASTQ files. I want to identify which of the 500 files contain DNA sequences corresponding to 20 sequence ID's. Ultimately I want to eliminate the sequences from files corresponding to sequence ID's and resave the files. Please guide/help.

file filtering sequence fastq • 155 views
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by tahir0

OK, thanks again. I tested and it worked for one fastqfile with list of id's. I do not know how to make it work and scan all 500 files! I used wild card (*.fastq) for in and out but gave error. Unless this command can be used in a sccript and loop. Ideas?

ADD REPLYlink written 6 weeks ago by tahir0

I wrote this loop and it worked. Thanks for tips/help

for i in *.fastq ; do in=$i out=removed/$i names=removelist.txt include=f; done;
ADD REPLYlink modified 6 weeks ago by _r_am32k • written 6 weeks ago by tahir0

This is not a novel answer, it is a loop that uses an existing answer. You should add it as a comment to the answer, not as a new answer.

ADD REPLYlink written 6 weeks ago by _r_am32k
gravatar for GenoMax
6 weeks ago by
United States
GenoMax95k wrote:

If you have the sequence names/ID then this is easy to do using from BBMap suite. Take a look at full in-line help (portion pasted below). names= can also be a file with multiple readID (one per line).

Usage: in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
And these are incorrect:

include=f       Set to 'true' to include the filtered names rather than excluding them.
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by GenoMax95k

I work with nanopore data, so not paired. I will try and see. Thanks I did try seqkit with seqkit grep -n -f removelist.txt -v test.fastq -o test.clean.fastq It does not not work for me. No error but saves the files as such with no filtering of reads.

ADD REPLYlink written 6 weeks ago by tahir0

Then just use in= and out=. Make sure the names are stripped of the leading @. You can also do partial matches (I know nanopore headers are super long at times).

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax95k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1947 users visited in the last hour