Filter transcripts based on CDS length
2
0
Entering edit mode
4.6 years ago

I have a large file contains genome annotation information of Arabidopsis thaliana obtained via Biomart. I want to extract transcripts and corresponding information which have highest CDS length (means those transcripts which codes for longest proteins). The file contains 34 columns and n rows, I am pasting first 7 columns and information of just 2 genes for sake of simplicity:

Gene stable ID  Transcript stable ID    Protein stable ID   CDS Length  Chromosome  Gene start  Gene end
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01030   AT1G01030.2 AT1G01030.2 1008    1   11649   13714
AT1G01030   AT1G01030.2 AT1G01030.2 1008    1   11649   13714
AT1G01030   AT1G01030.2 AT1G01030.2 1008    1   11649   13714
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737

Now I am interested only those transcripts which have maximum CDS length, so the desired output should be:

Gene stable ID  Transcript stable ID    Protein stable ID   CDS Length  Chromosome  Gene start  Gene end
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
RNA-Seq annotation • 1.4k views
ADD COMMENT
3
Entering edit mode
4.6 years ago
Haci ▴ 680

Whereas the solution offered by @EVR would be much faster, here is the dplyr equivalent:

library(data.table) # for fread()
library(dplyr)

input %>%
  group_by(GenestableID)%>%
  filter(CDSLength == max(CDSLength)) %>%
  ungroup()

# A tibble: 7 x 7
  GenestableID TranscriptstableID ProteinstableID CDSLength Chromosome Genestart Geneend
  <chr>        <chr>              <chr>               <int>      <int>     <int>   <int>
1 AT1G01030    AT1G01030.1        AT1G01030.1          1077          1     11649   13714
2 AT1G01030    AT1G01030.1        AT1G01030.1          1077          1     11649   13714
3 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
4 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
5 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
6 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
7 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
ADD COMMENT
2
Entering edit mode
4.6 years ago
EVR ▴ 610

HI,

You can group the data.frame based on genestabelId and later take max(CDS Length) which will provide what u want.

setDT(data.frame)[, .SD[which.max(CDS Length)], by=Gene stable ID]

ADD COMMENT

Login before adding your answer.

Traffic: 2013 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6