Merging position of all different CDS of a single gene in one line
1
0
Entering edit mode
5.6 years ago

Hellow,

I am finding the gene duplication event within genome. For this, I have to have the positional information of CDS of those genes.

The problem is i need only uniq ID with its positional information.

My file:

st1 PGSC0003DMC400026563    152418  152576
st1 PGSC0003DMC400026561    160499  160663
st1 PGSC0003DMC400039465    225140  225225
st1 PGSC0003DMC400039465    225786  225990
st1 PGSC0003DMC400039465    226430  226630
st1 PGSC0003DMC400039465    227247  227461
st1 PGSC0003DMC400039465    228093  228346
st1 PGSC0003DMC400039465    228815  228867
st1 PGSC0003DMC400039465    228960  229439
st1 PGSC0003DMC400039540    249208  249402

What I want:

st1 PGSC0003DMC400026563    152418  152576
st1 PGSC0003DMC400026561    160499  160663
st1 PGSC0003DMC400039465    225140  229439
st1 PGSC0003DMC400039540    249208  249402

Thankyou.

CDS GFF gene duplication • 1.8k views
ADD COMMENT
0
Entering edit mode

Is your file always sorted like that? That is, are the IDs always grouped together and the coordinates sorted?

ADD REPLY
0
Entering edit mode

No, I have sorted my original GFF file using awk command.

ADD REPLY
0
Entering edit mode

What have you tried? You have a clear idea of what you want, so you must have made some headway into getting there, right?

ADD REPLY
0
Entering edit mode

Can I also add that the tag 'gene duplication' is misplaced here. Those are not gene duplications but exons (CDS) of a single gene. So you just want the beginning and end coordinate of each gene, rather than the separate exons.

ADD REPLY
0
Entering edit mode

Yes, I know that they are not duplicated genes. I am trying to find gene duplication for which I need to make gff file, and for that I have to make a file of CDS with coordinates. You are right, I want start and end coordinate of a CDS.

Thankyou

ADD REPLY
0
Entering edit mode

Please aim for professional communication -

Yes, i know that they are not duplicated genes... i am trying to find gene duplication for which i need to make gff file for that i have to make a file of cds with coordinates.... and u r right, i want start and end coordinate of a CDS.

thankyou

would be:

Yes, I know that they are not duplicated genes. I am trying to find gene duplication for which I need to make gff file, and for that I have to make a file of cds with coordinates. And you are right, I want start and end coordinate of a CDS.

Thank You

ADD REPLY
3
Entering edit mode
5.6 years ago
rjactonspsfcf ▴ 160

There are a number of tool you could use to do that and i'm not sure what you would prefer, but here is a solution in R using tools from the 'tidyverse':

library(tidyverse)
data <- read_delim("~/Documents/tmp/cds.txt",delim="\t",col_names = c("contig","name","start","end"))

# get the min start and max end value for each ID
uniqueData <- data %>% 
    group_by(contig,name) %>%
    summarise(start=min(start),end=max(end)) 

write.table(uniqueData,file = "~/Documents/tmp/uniqueCDS.txt",sep = "\t",row.names = FALSE,col.names = FALSE,quote = FALSE)
ADD COMMENT
0
Entering edit mode

Thankyou, it worked.

ADD REPLY
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6