Question: How to retain the row or line with longest protein isoform from csv table
0
gravatar for jamesdong
9 days ago by
jamesdong0
China
jamesdong0 wrote:

I have 5 columns of data in csv format, the data title includes Organism, Accession_number, GeneID, Chromosome,and Protein_length, if different accession_numbers have same geneID, the line or row with longest protein length will be retained. Although there are many items and great answers about selecting longest isoforms: How to extract the longest isoform from multi fasta file how to select longest isoform per gene in gtf How to select the best isoform for a differential expressed gene in Trinity? select longest isoform from RefSeq proteome fasta longest transcript NCBI NM_ids most of them deal with one species. Here I want to retain the longest isoforms of some homologous from different species. For example, all proteins with the same GeneID 1636 in homo sapiens, which I will retain the line or row with longest Protein length of 1306 amino acids from, so does the Mus musculus with GeneID 11421, the line with longest Protein length of 1312 will be preserved, the rest of the table including Rattus norvegicus and Drosophila melanogaster will be kept it as it is, because each of them has unique GeneID. This is the example data:

Organism Accession_number GeneID Chromosome Protein_length
Homo_sapiens XP_006721800 1636 17 752
Homo_sapiens NP_690043 1636 17 732
Homo_sapiens NP_001369629 1636 17 1117
Homo_sapiens NP_001369630 1636 17 1022
Homo_sapiens NP_000780 1636 17 1306
Homo_sapiens NP_001171528 1636 17 691
Homo_sapiens NP_001369631 1636 17 511
Mus_musculus NP_033728 11421 11 732
Mus_musculus NP_997507 11421 11 1312
Mus_musculus NP_001268748 11421 11 1249
Rattus_norvegicus NP_036676 24310 10 1313
Drosophila_melanogaster NP_001260258 34189 2L 630
Drosophila_melanogaster NP_001285915 34805 2L 615
Drosophila_melanogaster NP_788042 34806 2L 611

the final result I want is like:

Organism Accession_number GeneID Chromosome Protein_length
Homo_sapiens NP_000780 1636 17 1306
Mus_musculus NP_997507 11421 11 1312
Rattus_norvegicus NP_036676 24310 10 1313
Drosophila_melanogaster NP_001260258 34189 2L 630
Drosophila_melanogaster NP_001285915 34805 2L 615
Drosophila_melanogaster NP_788042 34806 2L 611

Anyone can help me? Thanks in advance.

ADD COMMENTlink modified 1 day ago • written 9 days ago by jamesdong0
1

Sort decreasingly by Protein_length and then deduplicate by GeneID, so only take the first unique GeneID, this can e.g. by done by sort -u with the proper column in the -k option. Try it out, great exercise to learn.

ADD REPLYlink written 9 days ago by ATpoint36k

Thanks for kind reply. These days I learned the command sort, and I used

sort -k 5nr aa.txt > bb.txt

as your suggested to decrease by Protein_length successfully, however, when I use

sort -k 5 -u bb.txt

no changes happened, is there a problem?

ADD REPLYlink modified 7 days ago • written 7 days ago by jamesdong0
1

use datamash, group by GeneID, print maximum value for 5th column.

ADD REPLYlink modified 9 days ago • written 9 days ago by cpad011213k

Many thanks. I am learning the amazing command datamash, and run it successfully, however, when I use

datamash --sort --header-out groupby 3 max 5 < aa.txt > ab.txt

it ignored other columns, only left the deduplicated column 3 and their conresponding max value columns, I want all the columns there. Any other detailed suggestions? Thanks.

ADD REPLYlink modified 7 days ago • written 7 days ago by jamesdong0
1

try -f option

ADD REPLYlink written 6 days ago by cpad011213k

It is awesome ! I use datamash --sort --header-out groupby 3 max 5 -f < aa.txt > ab.txt it will creat all columns data including deduplicated column 3 and their conresponding max value columns, the header will be retained but changed name to field-1 field-2 field-3 field-4 field-5 max(field-5). Nevertheless, I have already obtained the desired result, thanks!

ADD REPLYlink modified 1 day ago • written 1 day ago by jamesdong0
1
gravatar for ATpoint
7 days ago by
ATpoint36k
Germany
ATpoint36k wrote:

sort -k5,5rn aa.txt | awk '!_[$3]++'

ADD COMMENTlink written 7 days ago by ATpoint36k

It works, it is great answer! Many thanks!

ADD REPLYlink written 6 days ago by jamesdong0

how did you handle the header? @ jamesdong

ADD REPLYlink written 5 days ago by cpad011213k

The header can be handled by other script offered by https://stackoverflow.com/questions/14562423/is-there-a-way-to-ignore-header-lines-in-a-unix-sort and https://stackoverflow.com/questions/21876640/last-line-to-first-line, the piped script is:

cat aa.txt  | awk 'NR<2{print $0;next}{print $0| "sort -r"}' | sort -k5,5rn |\
awk '!_[$3]++' | awk '{a[NR]=$0}END{for(x=1;x<NR;x++){if(x==1)print a[NR];print a[x]}}' > bb.txt
ADD REPLYlink modified 1 day ago • written 1 day ago by jamesdong0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1270 users visited in the last hour