Entering edit mode
5.6 years ago
Jason
▴
10
Hello All,
How to Grep only specific motif from complete sequences in a fasta file using shell command? Also, I want to include the lines beginning with a > before these target sequences. I got help from the previous post in this link: A: grep whole sequences containing a specific motif in a fasta file to grep whole sequence containing motifs but now I want to grep only motifs with protein id as a header. Some protein sequence has more than one motifs.
My motifs look like that : SXXXX(F/S)XXXL
Here are list of protein sequences
>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing protein 9; Short=hCARD9
MSDYENDDECWSVLEGSRVTLTSVIDRSRITPYLRQTKVLNPDDEEQVLSDPNLSIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMSEVMWFLQKLVQDLTALLSSK
>sp|Q9H37.2|CTYU_HUMAN
HHHSVLEGFRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESYSLTQLLMTEVMKLQKKVQDLTALLSSK
>sp|Q9re7.2|CARer_HUMAN RecName
BKLSVLEGWRVTLTSVIDRFRITPYLRQTKVLNPDDEEQVLSDPNLVIRKRKVGVLLDILQRTGHKGYVAFLESLELYYPQLYKKVTGKEPARVFSMIIDASGESGLTQLLMTEVMKLQKKVQDLTALLSSK
Result should be displayed like:
>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing
SVLEGSRVTL
>sp|Q9H257.2|CARD9_HUMAN RecName: Full=Caspase recruitment domain-containing
SEVMWFLQKL
>sp|Q9H37.2|CTYU_HUMAN
SVLEGFRVTL
>sp|Q9H37.2|CTYU_HUMAN
SGESSLTQL
This command will take the whole sequence that contains motif I don't want to do like that
grep -E 'S[A-Z]{4}[FS][A-Z]{3}L' jara3.fasta > jara4.fasta