Question: grep whole sequences containing a specific motif in a fasta file
0
gravatar for Jason
9 months ago by
Jason0
Jason0 wrote:

How to Grep the complete sequences containing a specific motif in a fasta file using shell command? Also, I want to include the lines beginning with a > before these target sequences.

I found this post :about Grep the complete sequences containing a specific motif in a fasta file similar to my problem but I'm looking for different motif:

My motifs look like that :

SXXXX(F/S)XXXL

All my fasta file in one line and I have more than 300 sequences:

for example:

my sequence :

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

>sp|Q9re7.2|CARer_HUMAN RecName
BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

The result should be only the first two sequences because they have the motifs SXXXX(F/S)XXXL

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

I tried these command but that returned all three sequences

grep 'S...F\|S\|L.\(.\)\1\{4\}' jara3.fasta -B 1 > jara4.fasta
sequencing • 481 views
ADD COMMENTlink modified 8 months ago by Malcolm.Cook1.0k • written 9 months ago by Jason0

What about if I want to keep only sequences that don’t have the motif” SXXXX (F/S)XXXL” and save that in new fasta file:

The result in that case should be only this sequence:

sp|Q9re7.2|CARer_HUMAN RecName BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

ADD REPLYlink written 9 months ago by Jason0

Read man grep and particularly read about the -v option, and think about how you would use that with the accepted answer.

ADD REPLYlink modified 9 months ago • written 9 months ago by Alex Reynolds28k
4
gravatar for Ark
9 months ago by
Ark70
US
Ark70 wrote:

I believe that {4} that you are using is an extended regular expression and you would need to either use egrep or the -E flag with grep.

I got it to work using this:

grep -E 'S[A-Z]{4}[FS][A-Z]{3}L' jara3.fasta > jara4.fasta

Hope that works!

ADD COMMENTlink modified 9 months ago • written 9 months ago by Ark70

It was working.

thank u so much

ADD REPLYlink written 9 months ago by Jason0
4
gravatar for cpad0112
9 months ago by
cpad011211k
India
cpad011211k wrote:

using seqkit, for sequences with regex (input from OP):

$ seqkit grep -srip 'S.{4}[FS].{3}L' test.fa 

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9 
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRK
VGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLM
TEVMKLQKKVQDLTALLSSK
>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDIL
QRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQK
KVQDLTALLSSK

For sequences without regex:

 $ seqkit grep -svrip 'S.{4}[FS].{3}L' test.fa 

>sp|Q9re7.2|CARer_HUMAN RecName 
BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDIL
QRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQK
KVQDLTALLSSK

Note: I used . in sequence assuming that only alphabets (AA) are present in sequence.

ADD COMMENTlink written 9 months ago by cpad011211k

It is working thank u

ADD REPLYlink written 9 months ago by Jason0
1
gravatar for Malcolm.Cook
8 months ago by
Malcolm.Cook1.0k
kansas, usa
Malcolm.Cook1.0k wrote:

if you happen to have MEME-Suite installed, you also have a very nice fasta-grep which

  • knows about IUPAC codes
  • ignores newlines and treats sequence
  • can emit positions or matches - your pick
ADD COMMENTlink written 8 months ago by Malcolm.Cook1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1084 users visited in the last hour