Question: Identify FASTQ files containing multiple DNA sequences based on specific sequence ID's
0
gravatar for tahir
6 weeks ago by
tahir0
tahir0 wrote:

I have 500 files, each file containing 4000 DNA sequences in FASTQ format. I have 20 sequence names or ID's extracted from BAM files based on alignments with DNA sequences in 500 FASTQ files. I want to identify which of the 500 files contain DNA sequences corresponding to 20 sequence ID's. Ultimately I want to eliminate the sequences from files corresponding to sequence ID's and resave the files. Please guide/help.

file filtering sequence fastq • 155 views
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by tahir0

OK, thanks again. I tested and it worked for one fastqfile with list of id's. I do not know how to make it work and scan all 500 files! I used wild card (*.fastq) for in and out but gave error. Unless this command can be used in a sccript and loop. Ideas?

ADD REPLYlink written 6 weeks ago by tahir0

I wrote this loop and it worked. Thanks for tips/help

for i in *.fastq ; do filterbyname.sh in=$i out=removed/$i names=removelist.txt include=f; done;
ADD REPLYlink modified 6 weeks ago by _r_am32k • written 6 weeks ago by tahir0

This is not a novel answer, it is a loop that uses an existing answer. You should add it as a comment to the answer, not as a new answer.

ADD REPLYlink written 6 weeks ago by _r_am32k
1
gravatar for GenoMax
6 weeks ago by
GenoMax95k
United States
GenoMax95k wrote:

If you have the sequence names/ID then this is easy to do using filterbyname.sh from BBMap suite. Take a look at full in-line help (portion pasted below). names= can also be a file with multiple readID (one per line).

Usage:  filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f>

in2 and out2 are for paired reads and are optional.
If input is paired and there is only one output file, it will be written interleaved.
Important!  Leading > and @ symbols are NOT part of sequence names;  they are part of
the fasta, fastq, and sam specifications.  Therefore, this is correct:
names=e.coli_K12
And these are incorrect:
names=>e.coli_K12
names=@e.coli_K12

Parameters:
include=f       Set to 'true' to include the filtered names rather than excluding them.
ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by GenoMax95k

I work with nanopore data, so not paired. I will try and see. Thanks I did try seqkit with seqkit grep -n -f removelist.txt -v test.fastq -o test.clean.fastq It does not not work for me. No error but saves the files as such with no filtering of reads.

ADD REPLYlink written 6 weeks ago by tahir0

Then just use in= and out=. Make sure the names are stripped of the leading @. You can also do partial matches (I know nanopore headers are super long at times).

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by GenoMax95k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1947 users visited in the last hour
_