Entering edit mode
8 weeks ago
reza
▴
300
I have a file with gpff format containing protein sequences (downloaded from NCBI), some genes have several isoform of proteins. how can I extract longest protein per gene? Of course, I want the proteins of all genes, and some genes may only have one version of the protein.
a part of .gpff file
LOCUS NP_001347130 1069 aa linear PLN 30-JAN-2018
DEFINITION uncharacterized protein LOC111828501 [Oryza sativa Japonica Group].
ACCESSION NP_001347130
VERSION NP_001347130.1
DBSOURCE REFSEQ: accession NM_001360201.1
KEYWORDS RefSeq.
SOURCE Oryza sativa Japonica Group (Japanese rice)
ORGANISM Oryza sativa Japonica Group
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliopsida; Liliopsida; Poales; Poaceae; BOP
clade; Oryzoideae; Oryzeae; Oryzinae; Oryza; Oryza sativa.
COMMENT VALIDATED REFSEQ: This record has undergone validation or
preliminary review. The reference sequence was derived from
AK064183.1 and AP014959.1.
FEATURES Location/Qualifiers
source 1..1069
/organism="Oryza sativa Japonica Group"
/db_xref="taxon:39947"
Protein 1..1069
/product="uncharacterized protein LOC111828501"
/calculated_mol_wt=121529
CDS 1..1069
/gene="LOC111828501"
/coded_by="NM_001360201.1:90..3299"
/db_xref="GeneID:111828501"
ORIGIN
1 madpedaaaa aaagneddve dlyadlddqv aaalaaages ggsnpatdge aeaeapgahh
61 teadaneavd lgdgtagyis sdeeseddlh ivlnedgaap pppppagrce egseegevsg
121 scvkglstdg grgklgelhr kglfekttap itgqgdrshq hafqkefnff lprnrtvfdv
181 dieafqekpw rqhgvdltdy fnfgldeesw rkycfdmehf rhgtrtlane lsglqqefhy
241 nlglsksvpk seiysvlkeg ngiakpkgra ihveggmher lpsadmwppr qrdsdviqvn
301 mmfppsnrss sddrstvndk cittkrcgps nnhpgvdeyl ketssvvdrv vdkevhkrgs
361 sectrsktvl gdsacagaqs stpdnsdmls eestedfhfk rkrgksnsna fyvetnrkde
421 hvlsdfcrha sksdqesskg eshrytpspa ddryhkatkr qrmdeagaci ssrslnncqs
481 dhhlhesghr akkelkrqsl aggkhalfer qenttdnyss ryarkhkhkr ssstflgtny
541 rvhnqlcekq eylplgraal rndeqcsady nqrhrrswre inddedivgc ysarrwqqrh
601 ddlhgshsml kaevcddidg hmyrerryee trkirhdrng ddeffhytdy rfgkvldped
661 rrrcrsqsae scdehfrrse hlvfdhfthp dqlmlshqan dnhrksekgw pgpaasltfm
721 rsrnrfidne riqngkmkyn hdgyyekkrq hdsvfdvddi qqpalytgsv aetgqcirpv
781 krrvhadhsm nrkdrfnssy qkgrrlmhgw smisdrdlyv aemhnspkdi dveamcspnd
841 mrnsnnipni ydkirhevvn lqprdtdnml lihrkrkfkr qgieirrvve sdsegclpad
901 sdlhgskhkn ihqkvrkpra frisrnqase kseqqkqqhv snnqeyeeie egelieqdhq
961 dtasrsksnh qrkvvlksvi eassacqggv inatskdadc sngatgecdn khilevmkkm
1021 qkrserfkas iatqkeeded rkeslavtcd vddiknqrpa rkrlwgcsg