Question: Merging position of all different CDS of a single gene in one line
0
gravatar for 1234anjalianjali1234
15 months ago by
India
1234anjalianjali123430 wrote:

Hellow,

I am finding the gene duplication event within genome. For this, I have to have the positional information of CDS of those genes.

The problem is i need only uniq ID with its positional information.

My file:

st1 PGSC0003DMC400026563    152418  152576
st1 PGSC0003DMC400026561    160499  160663
st1 PGSC0003DMC400039465    225140  225225
st1 PGSC0003DMC400039465    225786  225990
st1 PGSC0003DMC400039465    226430  226630
st1 PGSC0003DMC400039465    227247  227461
st1 PGSC0003DMC400039465    228093  228346
st1 PGSC0003DMC400039465    228815  228867
st1 PGSC0003DMC400039465    228960  229439
st1 PGSC0003DMC400039540    249208  249402

What I want:

st1 PGSC0003DMC400026563    152418  152576
st1 PGSC0003DMC400026561    160499  160663
st1 PGSC0003DMC400039465    225140  229439
st1 PGSC0003DMC400039540    249208  249402

Thankyou.

gene duplication gff cds • 499 views
ADD COMMENTlink modified 15 months ago by rjactonspsfcf90 • written 15 months ago by 1234anjalianjali123430

Is your file always sorted like that? That is, are the IDs always grouped together and the coordinates sorted?

ADD REPLYlink written 15 months ago by Devon Ryan93k

No, I have sorted my original GFF file using awk command.

ADD REPLYlink written 15 months ago by 1234anjalianjali123430

What have you tried? You have a clear idea of what you want, so you must have made some headway into getting there, right?

ADD REPLYlink written 15 months ago by RamRS25k

Can I also add that the tag 'gene duplication' is misplaced here. Those are not gene duplications but exons (CDS) of a single gene. So you just want the beginning and end coordinate of each gene, rather than the separate exons.

ADD REPLYlink modified 15 months ago • written 15 months ago by lieven.sterck6.4k

Yes, I know that they are not duplicated genes. I am trying to find gene duplication for which I need to make gff file, and for that I have to make a file of CDS with coordinates. You are right, I want start and end coordinate of a CDS.

Thankyou

ADD REPLYlink modified 15 months ago • written 15 months ago by 1234anjalianjali123430

Please aim for professional communication -

Yes, i know that they are not duplicated genes... i am trying to find gene duplication for which i need to make gff file for that i have to make a file of cds with coordinates.... and u r right, i want start and end coordinate of a CDS.

thankyou

would be:

Yes, I know that they are not duplicated genes. I am trying to find gene duplication for which I need to make gff file, and for that I have to make a file of cds with coordinates. And you are right, I want start and end coordinate of a CDS.

Thank You

ADD REPLYlink written 15 months ago by RamRS25k
2
gravatar for rjactonspsfcf
15 months ago by
Southampton
rjactonspsfcf90 wrote:

There are a number of tool you could use to do that and i'm not sure what you would prefer, but here is a solution in R using tools from the 'tidyverse':

library(tidyverse)
data <- read_delim("~/Documents/tmp/cds.txt",delim="\t",col_names = c("contig","name","start","end"))

# get the min start and max end value for each ID
uniqueData <- data %>% 
    group_by(contig,name) %>%
    summarise(start=min(start),end=max(end)) 

write.table(uniqueData,file = "~/Documents/tmp/uniqueCDS.txt",sep = "\t",row.names = FALSE,col.names = FALSE,quote = FALSE)
ADD COMMENTlink written 15 months ago by rjactonspsfcf90

Thankyou, it worked.

ADD REPLYlink written 15 months ago by 1234anjalianjali123430

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLYlink written 15 months ago by RamRS25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1253 users visited in the last hour