grep whole sequences containing a specific motif in a fasta file
3
1
Entering edit mode
5.6 years ago
Jason ▴ 10

How to Grep the complete sequences containing a specific motif in a fasta file using shell command? Also, I want to include the lines beginning with a > before these target sequences.

I found this post :about Grep the complete sequences containing a specific motif in a fasta file similar to my problem but I'm looking for different motif:

My motifs look like that :

SXXXX(F/S)XXXL

All my fasta file in one line and I have more than 300 sequences:

for example:

my sequence :

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

>sp|Q9re7.2|CARer_HUMAN RecName
BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

The result should be only the first two sequences because they have the motifs SXXXX(F/S)XXXL

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

I tried these command but that returned all three sequences

grep 'S...F\|S\|L.\(.\)\1\{4\}' jara3.fasta -B 1 > jara4.fasta
sequencing • 3.6k views
ADD COMMENT
0
Entering edit mode

What about if I want to keep only sequences that don’t have the motif” SXXXX (F/S)XXXL” and save that in new fasta file:

The result in that case should be only this sequence:

sp|Q9re7.2|CARer_HUMAN RecName BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK

ADD REPLY
0
Entering edit mode

Read man grep and particularly read about the -v option, and think about how you would use that with the accepted answer.

ADD REPLY
5
Entering edit mode
5.6 years ago

using seqkit, for sequences with regex (input from OP):

$ seqkit grep -srip 'S.{4}[FS].{3}L' test.fa 

>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9 
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRK
VGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLM
TEVMKLQKKVQDLTALLSSK
>sp|Q9H37.2|CTYU_HUMAN 
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDIL
QRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQK
KVQDLTALLSSK

For sequences without regex:

 $ seqkit grep -svrip 'S.{4}[FS].{3}L' test.fa 

>sp|Q9re7.2|CARer_HUMAN RecName 
BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDIL
QRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQK
KVQDLTALLSSK

Note: I used . in sequence assuming that only alphabets (AA) are present in sequence.

ADD COMMENT
0
Entering edit mode

It is working thank u

ADD REPLY
4
Entering edit mode
5.6 years ago
Ark ▴ 90

I believe that {4} that you are using is an extended regular expression and you would need to either use egrep or the -E flag with grep.

I got it to work using this:

grep -E 'S[A-Z]{4}[FS][A-Z]{3}L' jara3.fasta > jara4.fasta

Hope that works!

ADD COMMENT
0
Entering edit mode

It was working.

thank u so much

ADD REPLY
1
Entering edit mode
5.6 years ago
Malcolm.Cook ★ 1.5k

if you happen to have MEME-Suite installed, you also have a very nice fasta-grep which

  • knows about IUPAC codes
  • ignores newlines and treats sequence
  • can emit positions or matches - your pick
ADD COMMENT

Login before adding your answer.

Traffic: 3209 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6