Question: (Closed) grep only specific motif from whole protein sequence
0
gravatar for Jason
9 months ago by
Jason0
Jason0 wrote:

Hello All,

How to Grep only specific motif from complete sequences in a fasta file using shell command? Also, I want to include the lines beginning with a > before these target sequences. I got help from the previous post in this link: A: grep whole sequences containing a specific motif in a fasta file to grep whole sequence containing motifs but now I want to grep only motifs with protein id as a header. Some protein sequence has more than one motifs.

My motifs look like that : SXXXX(F/S)XXXL

Here are list of protein sequences

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLSIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMSEVMWFLQKLVQDLTALLSSK
>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESYSLTQLLMTEVMKLQKKVQDLTALLSSK
>sp|Q9re7.2|CARer_HUMAN RecName
BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

Result should be displayed like:

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing
SVLEGSRVTL
>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing
SEVMWFLQKL
>sp|Q9H37.2|CTYU_HUMAN 
SVLEGFRVTL
>sp|Q9H37.2|CTYU_HUMAN 
SGESSLTQL

This command will take the whole sequence that contains motif I don't want to do like that

grep -E 'S[A-Z]{4}[FS][A-Z]{3}L' jara3.fasta > jara4.fasta
sequencing • 375 views
ADD COMMENTlink modified 9 months ago by Pierre Lindenbaum121k • written 9 months ago by Jason0

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLYlink written 9 months ago by WouterDeCoster40k

Hello Jason!

We believe that this post does not fit the main topic of this site.

Pure unix question, a man grep is all it'll take to get to your answers, which involves grep -o. Or better, use bioawk

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLYlink modified 9 months ago • written 9 months ago by RamRS22k

The answer is in the first hits if you google grep only print matching pattern

ADD REPLYlink written 9 months ago by WouterDeCoster40k
2
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:
 cat input.fa | paste - - | awk -F '\t' '{N=match($2,/(S....[FS]...L)/,a);if(N==0) next;printf("%s\n%s\n",$1,a[1]);}'

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9
SVLEGSRVTL
>sp|Q9H37.2|CTYU_HUMAN 
SVLEGFRVTL

(assuming two lines par fasta record. Otherwise, use https://gist.github.com/lindenb/2c0d4e11fd8a96d4c345 )

ADD COMMENTlink written 9 months ago by Pierre Lindenbaum121k

Hey, Thanks for help but I'm getting error. How can I resolve it

awk: syntax error at source line 1

context is

>>> {N=match($2,/(S....[FS]...L)/, <<< 
awk: illegal statement at source line 1
awk: illegal statement at source line 1
ADD REPLYlink modified 9 months ago by RamRS22k • written 9 months ago by Jason0

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 9 months ago by RamRS22k
$ awk --version
GNU Awk 4.1.3, API: 1.1 (GNU MPFR 3.1.4, GNU MP 6.1.0)
Copyright (C) 1989, 1991-2015 Free Software Foundation.
ADD REPLYlink written 9 months ago by Pierre Lindenbaum121k
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1456 users visited in the last hour