How to retain the row or line with longest protein isoform from csv table
1
0
Entering edit mode
3.8 years ago
jamesdong • 0

I have 5 columns of data in csv format, the data title includes Organism, Accession_number, GeneID, Chromosome,and Protein_length, if different accession_numbers have same geneID, the line or row with longest protein length will be retained. Although there are many items and great answers about selecting longest isoforms: How to extract the longest isoform from multi fasta file how to select longest isoform per gene in gtf How to select the best isoform for a differential expressed gene in Trinity? select longest isoform from RefSeq proteome fasta longest transcript NCBI NM_ids most of them deal with one species. Here I want to retain the longest isoforms of some homologous from different species. For example, all proteins with the same GeneID 1636 in homo sapiens, which I will retain the line or row with longest Protein length of 1306 amino acids from, so does the Mus musculus with GeneID 11421, the line with longest Protein length of 1312 will be preserved, the rest of the table including Rattus norvegicus and Drosophila melanogaster will be kept it as it is, because each of them has unique GeneID. This is the example data:

Organism Accession_number GeneID Chromosome Protein_length
Homo_sapiens XP_006721800 1636 17 752
Homo_sapiens NP_690043 1636 17 732
Homo_sapiens NP_001369629 1636 17 1117
Homo_sapiens NP_001369630 1636 17 1022
Homo_sapiens NP_000780 1636 17 1306
Homo_sapiens NP_001171528 1636 17 691
Homo_sapiens NP_001369631 1636 17 511
Mus_musculus NP_033728 11421 11 732
Mus_musculus NP_997507 11421 11 1312
Mus_musculus NP_001268748 11421 11 1249
Rattus_norvegicus NP_036676 24310 10 1313
Drosophila_melanogaster NP_001260258 34189 2L 630
Drosophila_melanogaster NP_001285915 34805 2L 615
Drosophila_melanogaster NP_788042 34806 2L 611

the final result I want is like:

Organism Accession_number GeneID Chromosome Protein_length
Homo_sapiens NP_000780 1636 17 1306
Mus_musculus NP_997507 11421 11 1312
Rattus_norvegicus NP_036676 24310 10 1313
Drosophila_melanogaster NP_001260258 34189 2L 630
Drosophila_melanogaster NP_001285915 34805 2L 615
Drosophila_melanogaster NP_788042 34806 2L 611

Anyone can help me? Thanks in advance.

csv protein isoform python linux command • 1.7k views
ADD COMMENT
1
Entering edit mode

Sort decreasingly by Protein_length and then deduplicate by GeneID, so only take the first unique GeneID, this can e.g. by done by sort -u with the proper column in the -k option. Try it out, great exercise to learn.

ADD REPLY
0
Entering edit mode

Thanks for kind reply. These days I learned the command sort, and I used

sort -k 5nr aa.txt > bb.txt

as your suggested to decrease by Protein_length successfully, however, when I use

sort -k 5 -u bb.txt

no changes happened, is there a problem?

ADD REPLY
1
Entering edit mode

use datamash, group by GeneID, print maximum value for 5th column.

ADD REPLY
0
Entering edit mode

Many thanks. I am learning the amazing command datamash, and run it successfully, however, when I use

datamash --sort --header-out groupby 3 max 5 < aa.txt > ab.txt

it ignored other columns, only left the deduplicated column 3 and their conresponding max value columns, I want all the columns there. Any other detailed suggestions? Thanks.

ADD REPLY
1
Entering edit mode

try -f option

ADD REPLY
0
Entering edit mode

It is awesome ! I use datamash --sort --header-out groupby 3 max 5 -f < aa.txt > ab.txt it will creat all columns data including deduplicated column 3 and their conresponding max value columns, the header will be retained but changed name to field-1 field-2 field-3 field-4 field-5 max(field-5). Nevertheless, I have already obtained the desired result, thanks!

ADD REPLY
1
Entering edit mode
3.8 years ago
ATpoint 82k

sort -k5,5rn aa.txt | awk '!_[$3]++'

ADD COMMENT
0
Entering edit mode

It works, it is great answer! Many thanks!

ADD REPLY
0
Entering edit mode

how did you handle the header? @ jamesdong

ADD REPLY
0
Entering edit mode

The header can be handled by other script offered by https://stackoverflow.com/questions/14562423/is-there-a-way-to-ignore-header-lines-in-a-unix-sort and https://stackoverflow.com/questions/21876640/last-line-to-first-line, the piped script is:

cat aa.txt  | awk 'NR<2{print $0;next}{print $0| "sort -r"}' | sort -k5,5rn |\
awk '!_[$3]++' | awk '{a[NR]=$0}END{for(x=1;x<NR;x++){if(x==1)print a[NR];print a[x]}}' > bb.txt
ADD REPLY

Login before adding your answer.

Traffic: 2532 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6