How to extract the longest protein per gene from a multifast file.
1
0
Entering edit mode
3.9 years ago

Hello everyone, I am working with fasta files that contain protein sequences. My goal is to extract the longest protein for each gene from this file, to then use it in a cluster analysis. The header of each sequence look like this:

>FBpp0070000 type=protein; loc=X:join(19963955..19964071,19964782..19964944,19965006..19965126,19965197..19965511,19965577..19966071,19966183..19967012,19967081..19967223,19967284..19967460); ID=FBpp0070000; name=Nep3-PA; parent=FBgn0031081,FBtr0070000; dbxref=FlyBase:FBpp0070000,FlyBase_Annotation_IDs:CG9565-PA,GB_protein:AAF45370.2,REFSEQ:NP_523417,GB_protein:AAF45370,UniProt/Swiss-Prot:Q9W5Y0,FlyMine:FBpp0070000,modMine:FBpp0070000; MD5=19dfee3c4d8ec74f121a5b5f7f7682e1; length=786; release=r6.33; species=Dmel; 
MTRYKQTEFTEDDSSSIGGIQLNEATGHTGMQIRYHTARATWNWRSRNKTEKWLLITTFVMAITIFTLLIVLFTDGGSSD
ATKHVLHVQPHQKDCPSGNELPCLNKHCIFASSEILKSIDVTVDPCDDFYGYSCNQWIKNNPIPEGKSTWGTFGKLEQMN

>FBpp0300206 type=protein; loc=X:join(19963955..19964071,19964782..19964944,19965006..19965126,19965197..19965511,19965577..19966071,19966183..19967012,19967081..19967223,19967284..19967460); ID=FBpp0300206; name=Nep3-PB; parent=FBgn0031081,FBtr0307554; dbxref=REFSEQ:NP_001245775,GB_protein:AFH07487,FlyBase:FBpp0300206,FlyBase_Annotation_IDs:CG9565-PB,UniProt/Swiss-Prot:Q9W5Y0; MD5=19dfee3c4d8ec74f121a5b5f7f7682e1; length=786; release=r6.33; species=Dmel; 
MTRYKQTEFTEDDSSSIGGIQLNEATGHTGMQIRYHTARATWNWRSRNKTEKWLLITTFVMAITIFTLLIVLFTDGGSSD
ATKHVLHVQPHQKDCPSGNELPCLNKHCIFASSEILKSIDVTVDPCDDFYGYSCNQWIKNNPIPEGKSTWGTFGKLEQMN

>FBpp0300207 type=protein; loc=X:join(19963955..19964071,19964782..19964944,19965006..19965126,19965197..19965511,19965577..19966071,19966183..19967012,19967081..19967223,19967284..19967460); ID=FBpp0300207; name=Nep3-PC; parent=FBgn0031081,FBtr0307555; dbxref=REFSEQ:NP_001245776,GB_protein:AFH07488,FlyBase:FBpp0300207,FlyBase_Annotation_IDs:CG9565-PC,UniProt/Swiss-Prot:Q9W5Y0; MD5=19dfee3c4d8ec74f121a5b5f7f7682e1; length=786; release=r6.33; species=Dmel; 
MTRYKQTEFTEDDSSSIGGIQLNEATGHTGMQIRYHTARATWNWRSRNKTEKWLLITTFVMAITIFTLLIVLFTDGGSSD
ATKHVLHVQPHQKDCPSGNELPCLNKHCIFASSEILKSIDVTVDPCDDFYGYSCNQWIKNNPIPEGKSTWGTFGKLEQMN
QLIIRNVLEKPAKSFKSDAERKAKVYYESCLDADEHMEKLGAKPMNDLLLQIGGWNVTKSGYNVANWTMGHTLKILHNKY >

As you can see the header contains a lot of information, a lot of it useless for what I'm looking for. I need to retrieve a fasta file with the protein ID (FBpp), the gene ID or Parent (FBgn) of the longest protein of each gene.

I appreciate any advice you can give me.

regards

proteome bash python • 1.0k views
ADD COMMENT
0
Entering edit mode

Do you only want the single longest entry for a given fasta file?

If so you can achieve that with:

from Bio import SeqIO
import sys

longest = 0
length = 0

for record in SeqIO.parse(sys.argv[1], 'fasta'):
    if len(record.seq) > length:
        longest = record
        length = len(record.seq)

print(">{}_{}\n{}".format(longest.description, len(longest.seq), str(longest.seq)))
ADD REPLY
0
Entering edit mode

Is this what you are looking for ? johnmarondon11

sed '/>/ s/^>.*parent=\(.*\),.*;.*\sdb.*/>\1/g' test.fa | seqkit fx2tab  | awk '{print $0, length($2)}' | datamash groupby 1 max 3 -f | cut -f1,2 | seqkit tab2fx -w 0
>FBgn0031081
MTRYKQTEFTEDDSSSIGGIQLNEATGHTGMQIRYHTARATWNWRSRNKTEKWLLITTFVMAITIFTLLIVLFTDGGSSDATKHVLHVQPHQKDCPSGNELPCLNKHCIFASSEILKSIDVTVDPCDDFYGYSCNQWIKNNPIPEGKSTWGTFGKLEQMNQLIIRNVLEKPAKSFKSDAERKAKVYYESCLDADEHMEKLGAKPMNDLLLQIGGWNVTKSGYNVANWTMGHTLKILHNKY >

with datamash and seqkit

ADD REPLY
0
Entering edit mode
3.9 years ago
Mensur Dlakic ★ 27k

CD-HIT will do this automatically as part of its redundancy removal procedure. It clusters the sequences according to a given identity cutoff (90% is default), and for each cluster will retain only the longest sequence. Super fast, easy to install and run.

ADD COMMENT

Login before adding your answer.

Traffic: 2870 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6