Question: How to find protein motif in DNA sequence
1
gravatar for Benn
2.2 years ago by
Benn8.0k
Netherlands
Benn8.0k wrote:

I have a protein motif or site, which I like to identify in an DNA sequence (multiple fasta file). The motif is N-X-S/T (X!=P), which means Asn, followed by any amino acid but not Pro, followed by Ser or Thr. Also X should not be STOP. So I would like to find all the 3 codon combinations for this site in DNA (9 nucleotides).

I was first thinking of getting the motif written in DNA using IUPAC coding, but that seemed not possible. Writing out all possibilities seems like a too hard task, so I thought there might be a tool which can do this? Any suggestions?

motif sequence • 894 views
ADD COMMENTlink written 2.2 years ago by Benn8.0k

Doesn't BLAST(P) already support certain redundant characters?

I'm not sure you'll be able to define all of those exactly, since typically X means any amino acid (I think), without any restriction. You may not be able to find an alphabet that supports all of what you need.

You could maybe blast: NXS and NXT, and then filter the results with a regex to make sure that the next codon is != *

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Joe18k
4
gravatar for cschu181
2.2 years ago by
cschu1812.5k
cschu1812.5k wrote:

Haven't tried it, but you could do a 2-tiered grep-approach. Make sure your fasta is not line-wrapped.

grep -o "AA[CT][ACGT]\{3\}\([AU]C[ACGT]|AG[CU]\)" fasta_file | grep -v "[ACGT]\{3\}CC[ACGT][ACGT]\{3\}".

Assuming Asn = AAY = AA[CU], Ser = UCN, AGY = UC[ACGT], AG[CU], Thr = ACN = AC[ACGT], and Pro = CCN = CC[ACGT], the first part should match all peptides N-X(traditional = all amino acids)-S/T, the second should get rid off the ones that contain proline in the central position. I am not sure about whether you have to use an additional set of \(\) in the first expression.

ADD COMMENTlink written 2.2 years ago by cschu1812.5k

Sound like a good solution, I have tried it, and indeed the extra "\" is necessary. Can you explain what the "\" does here with grep? (I am learning, thanks!).

grep -o "aa[ct][acgt]\{3\}\([at]c[acgt]\|ag[ct]\)" fasta.fa | grep -v "[acgt]\{3\}cc[acgt][acgt]\{3\}"
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Benn8.0k

The backslash escapes special characters, so that they are not expanded by the shell. I never really understand which characters need to to be escaped and which don't... This post gives an overview but some stuff ("may need to be quoted under certain circumstances.") just feels as if one has freshly escaped from an asylum...

ADD REPLYlink written 2.2 years ago by cschu1812.5k

Which ones do and don’t need to be escaped depends on your shell, and whether you’re using extended regular expressions ( grep -e vs grep) and some other factors like whether you’re using quotes or not.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Joe18k

Yes, but this is all a big, big mess that way...

ADD REPLYlink written 2.2 years ago by cschu1812.5k

No worries about the backslashes @cschu181, every user (OP) is responsible to double check if the code given here as answer really does the trick (or tweak a little). In this case I could use your approach (and I liked it, especially the grep -v part). I actually modified it a bit, but the idea was certainly yours. I ended up using fuzznuc (from EMBOSS) with your pattern suggestion, and then grep -v to get rid of the Proline patterns.

fuzznuc -pattern AA[CT]NNN[AT]CN -sequence fasta.fa -outfile prog_pattern_1.txt

fuzznuc -pattern AA[CT]NNNAG[TC] -sequence fasta.fa -outfile prog_pattern_2.txt

cat prog_pattern_1.txt prog_pattern_2.txt | grep -v "[acgt]\{3\}cc[acgt][acgt]\{3\}" > prog_pattern_no_Pro.txt

So thanks for the help!

ADD REPLYlink written 2.2 years ago by Benn8.0k

Glad to help. Nice modification with fuzznuc (much more concise than regexing the whole thing.)

ADD REPLYlink written 2.2 years ago by cschu1812.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1731 users visited in the last hour