I have 500 files, each file containing 4000 DNA sequences in FASTQ format. I have 20 sequence names or ID's extracted from BAM files based on alignments with DNA sequences in 500 FASTQ files. I want to identify which of the 500 files contain DNA sequences corresponding to 20 sequence ID's. Ultimately I want to eliminate the sequences from files corresponding to sequence ID's and resave the files. Please guide/help.
Question: Identify FASTQ files containing multiple DNA sequences based on specific sequence ID's
6 weeks ago by
tahir • 0
tahir • 0 wrote:
ADD COMMENT • link •
6 weeks ago by
GenoMax ♦ 95k
GenoMax ♦ 95k wrote:
If you have the sequence names/ID then this is easy to do using
filterbyname.sh from BBMap suite. Take a look at full in-line help (portion pasted below).
names= can also be a file with multiple readID (one per line).
Usage: filterbyname.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> names=<string,string,string> include=<t/f> in2 and out2 are for paired reads and are optional. If input is paired and there is only one output file, it will be written interleaved. Important! Leading > and @ symbols are NOT part of sequence names; they are part of the fasta, fastq, and sam specifications. Therefore, this is correct: names=e.coli_K12 And these are incorrect: names=>e.coli_K12 firstname.lastname@example.org_K12 Parameters: include=f Set to 'true' to include the filtered names rather than excluding them.
ADD COMMENT • link
Please log in to add an answer.
Powered by Biostar version 2.3.0
Traffic: 1947 users visited in the last hour