Question: Filter transcripts based on CDS length
0
gravatar for waqaskhokhar999
11 days ago by
waqaskhokhar99970 wrote:

I have a large file contains genome annotation information of Arabidopsis thaliana obtained via Biomart. I want to extract transcripts and corresponding information which have highest CDS length (means those transcripts which codes for longest proteins). The file contains 34 columns and n rows, I am pasting first 7 columns and information of just 2 genes for sake of simplicity:

Gene stable ID  Transcript stable ID    Protein stable ID   CDS Length  Chromosome  Gene start  Gene end
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01030   AT1G01030.2 AT1G01030.2 1008    1   11649   13714
AT1G01030   AT1G01030.2 AT1G01030.2 1008    1   11649   13714
AT1G01030   AT1G01030.2 AT1G01030.2 1008    1   11649   13714
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.1 AT1G01110.1 1095    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737

Now I am interested only those transcripts which have maximum CDS length, so the desired output should be:

Gene stable ID  Transcript stable ID    Protein stable ID   CDS Length  Chromosome  Gene start  Gene end
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01030   AT1G01030.1 AT1G01030.1 1077    1   11649   13714
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
AT1G01110   AT1G01110.2 AT1G01110.2 1584    1   51953   54737
rna-seq annotation • 72 views
ADD COMMENTlink modified 11 days ago by Haci120 • written 11 days ago by waqaskhokhar99970
3
gravatar for Haci
11 days ago by
Haci120
Haci120 wrote:

Whereas the solution offered by @EVR would be much faster, here is the dplyr equivalent:

library(data.table) # for fread()
library(dplyr)

input %>%
  group_by(GenestableID)%>%
  filter(CDSLength == max(CDSLength)) %>%
  ungroup()

# A tibble: 7 x 7
  GenestableID TranscriptstableID ProteinstableID CDSLength Chromosome Genestart Geneend
  <chr>        <chr>              <chr>               <int>      <int>     <int>   <int>
1 AT1G01030    AT1G01030.1        AT1G01030.1          1077          1     11649   13714
2 AT1G01030    AT1G01030.1        AT1G01030.1          1077          1     11649   13714
3 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
4 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
5 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
6 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
7 AT1G01110    AT1G01110.2        AT1G01110.2          1584          1     51953   54737
ADD COMMENTlink written 11 days ago by Haci120
2
gravatar for EVR
11 days ago by
EVR560
Earth
EVR560 wrote:

HI,

You can group the data.frame based on genestabelId and later take max(CDS Length) which will provide what u want.

setDT(data.frame)[, .SD[which.max(CDS Length)], by=Gene stable ID]

ADD COMMENTlink written 11 days ago by EVR560
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 882 users visited in the last hour