Getting coordinates for a given sequence motif
2
0
Entering edit mode
5 weeks ago
elcortegano ▴ 110

I would like to get all coordinates for a given series of short sequence motifs, e.g. GATTACA (and its reverse complementary TGTAATC) from a reference genome, but want to do it in a way that allows me to get the coordinates of the sampled motif (because I need to intersect these coordinates with other annotation).

My current approach is based on a strict blastn-short search of that motif. However, this is extremely inefficient, because the minimum length for the algorithm to work requires sequences of a minimum length of 15 bp, so I need to append all possible combinations of nucleotides to the flanks of the motif to reach that minimum length. As consequence, the blast search is not just on a couple or a few motifs, but on thousands of them. For shorter targets, more than inefficient is is plain impractical.

There must be a way to do this more straightforwardly, but have doubts for longer ones, and I am wondering if anybody here has a better idea on how to address this problem.

Thank you,

EDITED: what I want really is not sampling, but getting all genomic coordinates for these motifs.

genome assembly • 252 views
ADD COMMENT
4
Entering edit mode
5 weeks ago

In addition to the options presented in ATpoint's link, seqkit locate will work also. It's become my go-to software for many common tasks for fasta and fastq files.

ADD COMMENT
2
Entering edit mode

Yeah, this is probably the best solution as it is an efficient command line tool and the solutions in the thread I linked are rather custom code. I always forget about seqkit but it is a super handy and efficient tool, swiss army knife for fasta/q files.

From the seqkit link:

# max mismatch: 1
$ cat t.fa \
  | seqkit locate -p agc -m 1 \
  | csvtk pretty -t
seqID   patternName     pattern strand  start   end    matched
seq     agc           agc       +        1       3     agc
seq     agc           agc       +        7       9     agc
seq     agc           agc       +        11      13    acc
seq     agc           agc       -        8       10    agc
seq     agc           agc       -        2       4     agc
ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

This looks promising, thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6