Hello everyone, I am working with fasta files that contain protein sequences. My goal is to extract the longest protein for each gene from this file, to then use it in a cluster analysis. The header of each sequence look like this:
>FBpp0070000 type=protein; loc=X:join(19963955..19964071,19964782..19964944,19965006..19965126,19965197..19965511,19965577..19966071,19966183..19967012,19967081..19967223,19967284..19967460); ID=FBpp0070000; name=Nep3-PA; parent=FBgn0031081,FBtr0070000; dbxref=FlyBase:FBpp0070000,FlyBase_Annotation_IDs:CG9565-PA,GB_protein:AAF45370.2,REFSEQ:NP_523417,GB_protein:AAF45370,UniProt/Swiss-Prot:Q9W5Y0,FlyMine:FBpp0070000,modMine:FBpp0070000; MD5=19dfee3c4d8ec74f121a5b5f7f7682e1; length=786; release=r6.33; species=Dmel;
MTRYKQTEFTEDDSSSIGGIQLNEATGHTGMQIRYHTARATWNWRSRNKTEKWLLITTFVMAITIFTLLIVLFTDGGSSD
ATKHVLHVQPHQKDCPSGNELPCLNKHCIFASSEILKSIDVTVDPCDDFYGYSCNQWIKNNPIPEGKSTWGTFGKLEQMN
>FBpp0300206 type=protein; loc=X:join(19963955..19964071,19964782..19964944,19965006..19965126,19965197..19965511,19965577..19966071,19966183..19967012,19967081..19967223,19967284..19967460); ID=FBpp0300206; name=Nep3-PB; parent=FBgn0031081,FBtr0307554; dbxref=REFSEQ:NP_001245775,GB_protein:AFH07487,FlyBase:FBpp0300206,FlyBase_Annotation_IDs:CG9565-PB,UniProt/Swiss-Prot:Q9W5Y0; MD5=19dfee3c4d8ec74f121a5b5f7f7682e1; length=786; release=r6.33; species=Dmel;
MTRYKQTEFTEDDSSSIGGIQLNEATGHTGMQIRYHTARATWNWRSRNKTEKWLLITTFVMAITIFTLLIVLFTDGGSSD
ATKHVLHVQPHQKDCPSGNELPCLNKHCIFASSEILKSIDVTVDPCDDFYGYSCNQWIKNNPIPEGKSTWGTFGKLEQMN
>FBpp0300207 type=protein; loc=X:join(19963955..19964071,19964782..19964944,19965006..19965126,19965197..19965511,19965577..19966071,19966183..19967012,19967081..19967223,19967284..19967460); ID=FBpp0300207; name=Nep3-PC; parent=FBgn0031081,FBtr0307555; dbxref=REFSEQ:NP_001245776,GB_protein:AFH07488,FlyBase:FBpp0300207,FlyBase_Annotation_IDs:CG9565-PC,UniProt/Swiss-Prot:Q9W5Y0; MD5=19dfee3c4d8ec74f121a5b5f7f7682e1; length=786; release=r6.33; species=Dmel;
MTRYKQTEFTEDDSSSIGGIQLNEATGHTGMQIRYHTARATWNWRSRNKTEKWLLITTFVMAITIFTLLIVLFTDGGSSD
ATKHVLHVQPHQKDCPSGNELPCLNKHCIFASSEILKSIDVTVDPCDDFYGYSCNQWIKNNPIPEGKSTWGTFGKLEQMN
QLIIRNVLEKPAKSFKSDAERKAKVYYESCLDADEHMEKLGAKPMNDLLLQIGGWNVTKSGYNVANWTMGHTLKILHNKY >
As you can see the header contains a lot of information, a lot of it useless for what I'm looking for. I need to retrieve a fasta file with the protein ID (FBpp), the gene ID or Parent (FBgn) of the longest protein of each gene.
I appreciate any advice you can give me.
regards
Do you only want the single longest entry for a given fasta file?
If so you can achieve that with:
Is this what you are looking for ? johnmarondon11
with datamash and seqkit