Question: How to find protein motif in DNA sequence
1
gravatar for Benn
16 months ago by
Benn7.9k
Netherlands
Benn7.9k wrote:

I have a protein motif or site, which I like to identify in an DNA sequence (multiple fasta file). The motif is N-X-S/T (X!=P), which means Asn, followed by any amino acid but not Pro, followed by Ser or Thr. Also X should not be STOP. So I would like to find all the 3 codon combinations for this site in DNA (9 nucleotides).

I was first thinking of getting the motif written in DNA using IUPAC coding, but that seemed not possible. Writing out all possibilities seems like a too hard task, so I thought there might be a tool which can do this? Any suggestions?

motif sequence • 654 views
ADD COMMENTlink written 16 months ago by Benn7.9k

Doesn't BLAST(P) already support certain redundant characters?

I'm not sure you'll be able to define all of those exactly, since typically X means any amino acid (I think), without any restriction. You may not be able to find an alphabet that supports all of what you need.

You could maybe blast: NXS and NXT, and then filter the results with a regex to make sure that the next codon is != *

ADD REPLYlink modified 16 months ago • written 16 months ago by Joe15k
4
gravatar for cschu181
16 months ago by
cschu1811.9k
cschu1811.9k wrote:

Haven't tried it, but you could do a 2-tiered grep-approach. Make sure your fasta is not line-wrapped.

grep -o "AA[CT][ACGT]\{3\}\([AU]C[ACGT]|AG[CU]\)" fasta_file | grep -v "[ACGT]\{3\}CC[ACGT][ACGT]\{3\}".

Assuming Asn = AAY = AA[CU], Ser = UCN, AGY = UC[ACGT], AG[CU], Thr = ACN = AC[ACGT], and Pro = CCN = CC[ACGT], the first part should match all peptides N-X(traditional = all amino acids)-S/T, the second should get rid off the ones that contain proline in the central position. I am not sure about whether you have to use an additional set of \(\) in the first expression.

ADD COMMENTlink written 16 months ago by cschu1811.9k

Sound like a good solution, I have tried it, and indeed the extra "\" is necessary. Can you explain what the "\" does here with grep? (I am learning, thanks!).

grep -o "aa[ct][acgt]\{3\}\([at]c[acgt]\|ag[ct]\)" fasta.fa | grep -v "[acgt]\{3\}cc[acgt][acgt]\{3\}"
ADD REPLYlink modified 16 months ago • written 16 months ago by Benn7.9k

The backslash escapes special characters, so that they are not expanded by the shell. I never really understand which characters need to to be escaped and which don't... This post gives an overview but some stuff ("may need to be quoted under certain circumstances.") just feels as if one has freshly escaped from an asylum...

ADD REPLYlink written 16 months ago by cschu1811.9k

Which ones do and don’t need to be escaped depends on your shell, and whether you’re using extended regular expressions ( grep -e vs grep) and some other factors like whether you’re using quotes or not.

ADD REPLYlink modified 16 months ago • written 16 months ago by Joe15k

Yes, but this is all a big, big mess that way...

ADD REPLYlink written 16 months ago by cschu1811.9k

No worries about the backslashes @cschu181, every user (OP) is responsible to double check if the code given here as answer really does the trick (or tweak a little). In this case I could use your approach (and I liked it, especially the grep -v part). I actually modified it a bit, but the idea was certainly yours. I ended up using fuzznuc (from EMBOSS) with your pattern suggestion, and then grep -v to get rid of the Proline patterns.

fuzznuc -pattern AA[CT]NNN[AT]CN -sequence fasta.fa -outfile prog_pattern_1.txt

fuzznuc -pattern AA[CT]NNNAG[TC] -sequence fasta.fa -outfile prog_pattern_2.txt

cat prog_pattern_1.txt prog_pattern_2.txt | grep -v "[acgt]\{3\}cc[acgt][acgt]\{3\}" > prog_pattern_no_Pro.txt

So thanks for the help!

ADD REPLYlink written 16 months ago by Benn7.9k

Glad to help. Nice modification with fuzznuc (much more concise than regexing the whole thing.)

ADD REPLYlink written 16 months ago by cschu1811.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1434 users visited in the last hour