Entering edit mode
6.9 years ago
ste.lu
▴
80
Hi All,
If I'm not wrong with the command subseq of seqtk program I am able to fish out some reads with a specific sequence, is that right?
the cmd is:
seqtk subseq in.fastq list.txt > out.fastq
In this case it is not clear the structure of the file list.txt, are the sequence I want to look for enough? I didn't find anything about this file in the documents of the seqtk.
Thank you
one id per line (without @).
Example fastq:
Example list:
Output:
Thanks for the answer!
Then this is not what I was looking for. Is there a way to look for a particular sequence into FASTQ files and fish out all the reads that have that sequence?
There are two ways to fish out reads of interest from a fastq: 1) by read ID 2) by Sequence/part of sequence
Command posted in OP pertains to fish out reads by name. If you want fish out sequence, you may want to furnish example input and expected output here.
Thanks for the explanation cpad0112.
Let's say I have this fastq file:
and I want to have in another file (i.e.
subset.fq
) only the reads that contain:'GTTCATAGCTGTTTC'
To make a practical example: I have a Chip-Seq and I want to have only the reads with a specific motif in another file.
Thank you
I am not sure if seqtk can subset by string (pattern). Instead, try following. Download / install seqkit from here: https://github.com/shenwei356/seqkit#installation
Input (since fastq sequences provided above 2 full fastq records and half fastq record without quality information, I trimmed it to 2 fastq records):
output:
Thanks cpad, it's working!! Actually I want them in another file so I am using:
Do you agree?
Thanks a lot again!
correct. But consider following suggestions:
In options -sdip (-d) , d means IUPAC degenerate mode and r means regex. -sdip match degenerate bases in query with those from target sequence. However this doesn't work in reverse.
For eg if query sequence has degenerate base "S", program will match with matching DNA bases "G" or "C" (as per IUPAC DNA code) in the target sequence. Motif GTTCATAGCTGTTTS matches both GTTCATAGCTGTTTG and GTTCATAGCTGTTTC in target. In your case, query sequence doesn't have any degenerate bases in the motif.
Following is an example: Motif is GTTTTCSCC and target is GTTTTCGCC.
No results
result with same reference fastq because S in query matches with G or C:
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thanks Ram for reformatting the question!