I have 5 columns of data in csv format, the data title includes Organism, Accession_number, GeneID, Chromosome,and Protein_length, if different accession_numbers have same geneID, the line or row with longest protein length will be retained. Although there are many items and great answers about selecting longest isoforms: How to extract the longest isoform from multi fasta file how to select longest isoform per gene in gtf How to select the best isoform for a differential expressed gene in Trinity? select longest isoform from RefSeq proteome fasta longest transcript NCBI NM_ids most of them deal with one species. Here I want to retain the longest isoforms of some homologous from different species. For example, all proteins with the same GeneID 1636 in homo sapiens, which I will retain the line or row with longest Protein length of 1306 amino acids from, so does the Mus musculus with GeneID 11421, the line with longest Protein length of 1312 will be preserved, the rest of the table including Rattus norvegicus and Drosophila melanogaster will be kept it as it is, because each of them has unique GeneID. This is the example data:
Organism Accession_number GeneID Chromosome Protein_length Homo_sapiens XP_006721800 1636 17 752 Homo_sapiens NP_690043 1636 17 732 Homo_sapiens NP_001369629 1636 17 1117 Homo_sapiens NP_001369630 1636 17 1022 Homo_sapiens NP_000780 1636 17 1306 Homo_sapiens NP_001171528 1636 17 691 Homo_sapiens NP_001369631 1636 17 511 Mus_musculus NP_033728 11421 11 732 Mus_musculus NP_997507 11421 11 1312 Mus_musculus NP_001268748 11421 11 1249 Rattus_norvegicus NP_036676 24310 10 1313 Drosophila_melanogaster NP_001260258 34189 2L 630 Drosophila_melanogaster NP_001285915 34805 2L 615 Drosophila_melanogaster NP_788042 34806 2L 611
the final result I want is like:
Organism Accession_number GeneID Chromosome Protein_length Homo_sapiens NP_000780 1636 17 1306 Mus_musculus NP_997507 11421 11 1312 Rattus_norvegicus NP_036676 24310 10 1313 Drosophila_melanogaster NP_001260258 34189 2L 630 Drosophila_melanogaster NP_001285915 34805 2L 615 Drosophila_melanogaster NP_788042 34806 2L 611
Anyone can help me? Thanks in advance.