Question: grep whole sequences containing a specific motif in a fasta file
0
gravatar for Jason
23 days ago by
Jason0
Jason0 wrote:

How to Grep the complete sequences containing a specific motif in a fasta file using shell command? Also, I want to include the lines beginning with a > before these target sequences.

I found this post :about Grep the complete sequences containing a specific motif in a fasta file similar to my problem but I'm looking for different motif:

My motifs look like that :

SXXXX(F/S)XXXL

All my fasta file in one line and I have more than 300 sequences:

for example:

my sequence :

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

>sp|Q9re7.2|CARer_HUMAN RecName
BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

The result should be only the first two sequences because they have the motifs SXXXX(F/S)XXXL

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

I tried these command but that returned all three sequences

grep 'S...F\|S\|L.\(.\)\1\{4\}' jara3.fasta -B 1 > jara4.fasta
sequencing • 202 views
ADD COMMENTlink modified 22 days ago by Malcolm.Cook850 • written 23 days ago by Jason0

What about if I want to keep only sequences that don’t have the motif” SXXXX (F/S)XXXL” and save that in new fasta file:

The result in that case should be only this sequence:

sp|Q9re7.2|CARer_HUMAN RecName BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

ADD REPLYlink written 23 days ago by Jason0

Read man grep and particularly read about the -v option, and think about how you would use that with the accepted answer.

ADD REPLYlink modified 23 days ago • written 23 days ago by Alex Reynolds26k
4
gravatar for Ark
23 days ago by
Ark40
US
Ark40 wrote:

I believe that {4} that you are using is an extended regular expression and you would need to either use egrep or the -E flag with grep.

I got it to work using this:

grep -E 'S[A-Z]{4}[FS][A-Z]{3}L' jara3.fasta > jara4.fasta

Hope that works!

ADD COMMENTlink modified 23 days ago • written 23 days ago by Ark40

It was working.

thank u so much

ADD REPLYlink written 23 days ago by Jason0
4
gravatar for cpad0112
23 days ago by
cpad01129.3k
India
cpad01129.3k wrote:

using seqkit, for sequences with regex (input from OP):

$ seqkit grep -srip 'S.{4}[FS].{3}L' test.fa 

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9 
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRK
VGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLM
TEVMKLQKKVQDLTALLSSK
>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDIL
QRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQK
KVQDLTALLSSK

For sequences without regex:

 $ seqkit grep -svrip 'S.{4}[FS].{3}L' test.fa 

>sp|Q9re7.2|CARer_HUMAN RecName 
BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDIL
QRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQK
KVQDLTALLSSK

Note: I used . in sequence assuming that only alphabets (AA) are present in sequence.

ADD COMMENTlink written 23 days ago by cpad01129.3k

It is working thank u

ADD REPLYlink written 23 days ago by Jason0
1
gravatar for Malcolm.Cook
22 days ago by
Malcolm.Cook850
kansas, usa
Malcolm.Cook850 wrote:

if you happen to have MEME-Suite installed, you also have a very nice fasta-grep which

  • knows about IUPAC codes
  • ignores newlines and treats sequence
  • can emit positions or matches - your pick
ADD COMMENTlink written 22 days ago by Malcolm.Cook850
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1881 users visited in the last hour