Longest Protein per Gene from gpff file
1
0
Entering edit mode
8 weeks ago
reza ▴ 300

I have a file with gpff format containing protein sequences (downloaded from NCBI), some genes have several isoform of proteins. how can I extract longest protein per gene? Of course, I want the proteins of all genes, and some genes may only have one version of the protein.

a part of .gpff file

LOCUS       NP_001347130            1069 aa            linear   PLN 30-JAN-2018
DEFINITION  uncharacterized protein LOC111828501 [Oryza sativa Japonica Group].
ACCESSION   NP_001347130
VERSION     NP_001347130.1
DBSOURCE    REFSEQ: accession NM_001360201.1
KEYWORDS    RefSeq.
SOURCE      Oryza sativa Japonica Group (Japanese rice)
  ORGANISM  Oryza sativa Japonica Group
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliopsida; Liliopsida; Poales; Poaceae; BOP
            clade; Oryzoideae; Oryzeae; Oryzinae; Oryza; Oryza sativa.
COMMENT     VALIDATED REFSEQ: This record has undergone validation or
            preliminary review. The reference sequence was derived from
            AK064183.1 and AP014959.1.
FEATURES             Location/Qualifiers
     source          1..1069
                     /organism="Oryza sativa Japonica Group"
                     /db_xref="taxon:39947"
     Protein         1..1069
                     /product="uncharacterized protein LOC111828501"
                     /calculated_mol_wt=121529
     CDS             1..1069
                     /gene="LOC111828501"
                     /coded_by="NM_001360201.1:90..3299"
                     /db_xref="GeneID:111828501"
ORIGIN      
        1 madpedaaaa aaagneddve dlyadlddqv aaalaaages ggsnpatdge aeaeapgahh
       61 teadaneavd lgdgtagyis sdeeseddlh ivlnedgaap pppppagrce egseegevsg
      121 scvkglstdg grgklgelhr kglfekttap itgqgdrshq hafqkefnff lprnrtvfdv
      181 dieafqekpw rqhgvdltdy fnfgldeesw rkycfdmehf rhgtrtlane lsglqqefhy
      241 nlglsksvpk seiysvlkeg ngiakpkgra ihveggmher lpsadmwppr qrdsdviqvn
      301 mmfppsnrss sddrstvndk cittkrcgps nnhpgvdeyl ketssvvdrv vdkevhkrgs
      361 sectrsktvl gdsacagaqs stpdnsdmls eestedfhfk rkrgksnsna fyvetnrkde
      421 hvlsdfcrha sksdqesskg eshrytpspa ddryhkatkr qrmdeagaci ssrslnncqs
      481 dhhlhesghr akkelkrqsl aggkhalfer qenttdnyss ryarkhkhkr ssstflgtny
      541 rvhnqlcekq eylplgraal rndeqcsady nqrhrrswre inddedivgc ysarrwqqrh
      601 ddlhgshsml kaevcddidg hmyrerryee trkirhdrng ddeffhytdy rfgkvldped
      661 rrrcrsqsae scdehfrrse hlvfdhfthp dqlmlshqan dnhrksekgw pgpaasltfm
      721 rsrnrfidne riqngkmkyn hdgyyekkrq hdsvfdvddi qqpalytgsv aetgqcirpv
      781 krrvhadhsm nrkdrfnssy qkgrrlmhgw smisdrdlyv aemhnspkdi dveamcspnd
      841 mrnsnnipni ydkirhevvn lqprdtdnml lihrkrkfkr qgieirrvve sdsegclpad
      901 sdlhgskhkn ihqkvrkpra frisrnqase kseqqkqqhv snnqeyeeie egelieqdhq
      961 dtasrsksnh qrkvvlksvi eassacqggv inatskdadc sngatgecdn khilevmkkm
     1021 qkrserfkas iatqkeeded rkeslavtcd vddiknqrpa rkrlwgcsg
Protein gpff • 413 views
ADD COMMENT
1
Entering edit mode
8 weeks ago

Have a look ar AGAT , with special focus on the agat_sp_keep_longest_isoform.pl subcommand.

You might first need to convert the genbank file to GFF (or GTF) , but there is functionality for that as well in AGAT

ADD COMMENT

Login before adding your answer.

Traffic: 1594 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6