Grep specific sequence
4
0
Entering edit mode
8 days ago

Hi there, Is there any tools to find specific sequence (not as part of a longer sequence) in a fastq file?

So for example if I am looking for AACTACGTCCATCG and only wants "hits" with exact matches AACTACGTCCATCG even thought its within a longer sequence i.e GCACTACTGGTCA(AACTACGTCCATC)? I know that using the function grep will give you both AACTACGTCCATC and GCACTACTGGTCA(AACTACGTCCATC).

Hope it makes sense.

Cheers

fastq smallRNA_Seq smallRNA GREP • 492 views
1
Entering edit mode

"not as part of a longer sequence" is exactly the opposite of "even though its withing a longer sequence". Maybe you should clarify that, although it looks like a simple grep would do.

1
Entering edit mode
8 days ago

If it is a "well-behaved" (no blank lines, no multi-line seqs, no such sequence in the quality string) fastq file the following also works very nicely to give you all matching fastq records, so no need to install any toolkit:

grep -B1 -A2 -e "^AACTACGTCCATC$" test.fq  ADD COMMENT 1 Entering edit mode And if it is a "very-well-behaved" one (no such sequence in any header, then a fixed word search would be even slightly faster ;) grep -B1 -A2 -F -w AACTACGTCCATC test.fq  Edit: just tested on a very large fastq file and the -w option is not at all faster than a proper start-end search. grep -B1 -A2 "^AACTACGTCCATC$" test.fq is definitely faster.

0
Entering edit mode

Thanks a lot!!

0
Entering edit mode

marie.lorans : Based on clarification below this is the result you wanted (though others are useful in other circumstances). So you should accept Michael Dondrup answer (green checkmark) to provide closure to this thread.

0
Entering edit mode

I think in this context "very-well-behaved" == "well-behaved". I thought about the headers as well, but fastq headers start with @, so that should be ok, but it could theoretically appear in the quality string.

0
Entering edit mode

Why the -e?

0
Entering edit mode

Looks more like a habit, maybe to search for multiple patterns, rather than a requirement in this case.

0
Entering edit mode

Indeed, I always write it like this, think some linux guru taught me that this was better.

1
Entering edit mode
8 days ago

Seqkit grep should do that: See https://bioinf.shenwei.me/seqkit/usage/#grep The command might look like this (tested, works ok, insert your sequence of choice between ^ and $): seqkit grep -s -r -i -P -p "^aggcg$"  file.fq

1
Entering edit mode
8 days ago
GenoMax 99k

bbduk.sh from BBMap suite in filter mode would work as well. A GUIDE is available.

\$ bbduk.sh -Xmx2g in=file1.fq literal=your_seq_of_interest outm=out.fq


Use in1= in2= outm1= outm2= if you have paired-end data.

0
Entering edit mode
8 days ago
sg927357 ▴ 20

Use grep -o flag it will give you the exact sequence. Usage: grep -o AACTACGTCCATCG filename

1
Entering edit mode

This is wrong. -o does not limit matching, it is a display related flag.

grep -o AACTACGTCCATCG filename


will print

AACTACGTCCATCG


as many times as the file has that string in it, it won't match in an "exact" manner as grep -w would.

0
Entering edit mode

Yeah I know, but accordingly to the question asked, "its within a longer sequence i.e GCACTACTGGTCA(AACTACGTCCATC)", if the string matches within a sequence, it will display only the match portion.

0
Entering edit mode

OP's wording is quite ambiguous - it's not clear if they only want exact matches or if they want all matches with exact overlaps highlighted. The latter doesn't make much sense, but I guess if that were their question, your answer would be the best fit. Let's see if they clarify on their requirement.

0
Entering edit mode

Thanks a lot. We want exact matches i.e AACTACGTCCATC but not if it's a part of a longer sequence