Extract longest transcript from Ensembl gene table
1
0
Entering edit mode
10 months ago
Midge • 0

Hello all, I need help to select rows from gene table, selecting only longest transcript when multiple are available. For each unique gene ID (gene), I need to extract the row with the longest transcript (length). I would like to keep the whole row because I need the ID of the protein and coordinates for further analysis. See below for a sample of my table, with data extracted from Ensembl BioMart. For example for ENSGACG00000000027.1 i would keep only this row with biggest length (3024) (ENSGACG00000000027.1 ENSGACP00000000041.1 scaffold_89 278309 290469 3024).

The whole table has about 60K lines so I could really appreciate some help sorting these. I have been trying to find this out but similar questions focus on processing fasta files.

' gene  protein chr start   end length
ENSGACG00000000022.1    ENSGACP00000000030.1    scaffold_89 230109  238032  2415
ENSGACG00000000022.1    ENSGACP00000000031.1    scaffold_89 230109  238032  1284
ENSGACG00000000023.1    ENSGACP00000000032.1    scaffold_89 240752  246187  864
ENSGACG00000000024.1    ENSGACP00000000033.1    scaffold_89 263731  273624  3684
ENSGACG00000000025.1    ENSGACP00000000034.1    scaffold_89 275261  277377  780
ENSGACG00000000026.1    ENSGACP00000000035.1    scaffold_1045   508 2074    678
ENSGACG00000000026.1    ENSGACP00000000036.1    scaffold_1045   508 2074    810
ENSGACG00000000027.1    ENSGACP00000000037.1    scaffold_89 278309  290469  1289
ENSGACG00000000027.1    ENSGACP00000000038.1    scaffold_89 278309  290469  1305
ENSGACG00000000027.1    ENSGACP00000000041.1    scaffold_89 278309  290469  3024

bash gene ensembl R • 337 views
1
Entering edit mode

is this what you are looking for?

\$ datamash -sHf -g 1 max 6 < table.txt

gene    protein chr start   end length  max(length)
ENSGACG00000000022.1    ENSGACP00000000030.1    scaffold_89 230109  238032  2415    2415
ENSGACG00000000023.1    ENSGACP00000000032.1    scaffold_89 240752  246187  864 864
ENSGACG00000000024.1    ENSGACP00000000033.1    scaffold_89 263731  273624  3684    3684
ENSGACG00000000025.1    ENSGACP00000000034.1    scaffold_89 275261  277377  780 780
ENSGACG00000000026.1    ENSGACP00000000036.1    scaffold_1045   508 2074    810 810
ENSGACG00000000027.1    ENSGACP00000000041.1    scaffold_89 278309  290469  3024    3024

0
Entering edit mode

Thanks that works well and very fast even for the big files. One question though. When I save the output in a file, like below, and open it in excel, the last column (new column after datamash), shows up as an extra row under each original row. This doesnt seem to be an issue when I grep rows or columns through the command line. But wondering if there's better way to save stdout. There was no info in the datamash help page about this

datamash -sHf -g 1 max 6 < input.txt > output.txt

0
Entering edit mode

I checked the output on windows 10 with excel (office) 2019 home and student version and not able to reproduce the issue you are reporting. Open the file in an text editor and see if there are any issues. It would also help if you could post a screenshot.

0
Entering edit mode

Are you open to all programming languages/methods to get this? Are you after the longest in terms of cDNA length, protein length or genomic length (including introns)?

0
Entering edit mode

I m happy to try most programming solutions, preferable bash or R, but could also try python/perl if you have it. Would that work?

0
Entering edit mode

The principle with these kinds of selections is always the same. You sort this file by decreasing length (here that would be column 6), so longer entries rank higher, and then starting from top to bottom you only keep the unique gene names. Since the file is sorted by length you will for every gene (or whatever you select) pick up the longest per unique name. Please try something, this is great learning experience.

0
Entering edit mode
10 months ago