Cluster of neighboring genes by index (Looking for linux shell Script)
1
0
Entering edit mode
2.1 years ago

Hi, everyone

I have a dataset like this .

1st column has protein Ids of different organisms and 2nd column has domain names.

Protein Ids domain
Abiotrophia_defectiva_peg_0144  wzz
Abiotrophia_defectiva_peg_0198  wxy
Abiotrophia_defectiva_peg_0200  wzz
Abiotrophia_defectiva_peg_0215  wca
Abyssicoccus_albus_peg_1185 wzz
Abyssicoccus_albus_peg_1189 wzx
Abyssicoccus_albus_peg_1200 wza
Abyssicoccus_albus_peg_1322 wca
Abyssicoccus_albus_peg_1324 wbb
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa
Bradyrhizobium_elkanii_peg_6752 wca
Bradyrhizobium_elkanii_peg_6780 wvx

I want to know which proteins are near by

means according to "peg" numbers if I say protein numbers coming under +/- 5 they are near by and they form cluster.

Output should look like :

Abiotrophia_defectiva_peg_0198 wxy
Abiotrophia_defectiva_peg_0200 wzz
Abyssicoccus_albus_peg_1185 wzz
Abyssicoccus_albus_peg_1189 wzx
Abyssicoccus_albus_peg_1322 wca
Abyssicoccus_albus_peg_1324 wbb
Bradyrhizobium_elkanii_peg_6717 wac
Bradyrhizobium_elkanii_peg_6718 wzx
Bradyrhizobium_elkanii_peg_6721 waa

Is there any way I can do this task for my data? I can do manually in excel sheet but my dataset is very large. So I need some script for this

Please do let me know

Thanks

shell-scripting linux • 1.3k views
ADD COMMENT
1
Entering edit mode

So the desired output is to filter and retain only those genes that are at a distance of +-5 genes from any other gene within the same organism?

While the task at hand is a relatively easy one, a word of caution: Are you sure your protein-encoding genes are always numbered in the correct way? While it might be often the case, it may not be fully consistent. Some genes might be on plasmids, then the numbering breaks and you will get false positives, so if possible also include the replicon id. Also, bacterial replicons are circular, then you might miss out on a few occasions where the cluster spans the origin.

Also, we need to know how many genes there are per organism on average, if there are too many, we get problems with the distance calculation and should rather use interval overlap .

ADD REPLY
0
Entering edit mode

Hi, yes my protein dataset is sequencial in order. I have 3982 organisms and 72553 proteins total. I want those clusters to be made according to organisms . LIke I have mentioned in the desired output. And yes you are absolutely right. but currently I am not concerned about the plasmid part. even though the sequence breaks thats fine . I just want to make a cluster of near by proteins thats already in my dataset .

ADD REPLY
0
Entering edit mode

Ok, so that's ~18 proteins per organism = ~165 comparisons per org. and a total of approximately 660,000 operations. That might be possible with a naive implementation. I can give you a perl script that does that.

ADD REPLY
0
Entering edit mode

HI, Sorry to bother you again.

but can you modify the script accordingly .

I have a dataset like this .

1st coloumn has protein Ids of different organisms and 2nd column has domain names.

Protein Ids domain

Abiotrophia_defectiva_peg_0144 wzz

Abiotrophia_defectiva_peg_0198 wxy

Abiotrophia_defectiva_peg_0200 wzz

Abiotrophia_defectiva_peg_0215 wca

Abyssicoccus_albus_123_peg_1185 wzz

Abyssicoccus_albus_123_peg_1189 wzx

Abyssicoccus_albus_123_peg_1200 wza

Abyssicoccus_albus_123_peg_1322 wca

Abyssicoccus_albus_123_peg_1324 wbb

Bradyrhizobium_elkanii_peg_6717 wac

Bradyrhizobium_elkanii_peg_6718 wzx

Bradyrhizobium_elkanii_peg_6721 waa

Bradyrhizobium_elkanii_peg_6752 wca

Bradyrhizobium_elkanii_peg_6780 wvx

I want to know whic proteins are near by . means according to "peg" numbers if I say protein numbers coming under +/- 5 they are near by and they form cluster. Please keep +/-5 flexible. I might need +/-10 later.

output should look like :

Abiotrophia_defectiva_peg_0198 wxy

Abiotrophia_defectiva_peg_0200 wzz


Abyssicoccus_albus_123_peg_1185 wzz

Abyssicoccus_albus_123_peg_1189 wzx


Abyssicoccus_albus_123_peg_1322 wca

Abyssicoccus_albus_123_peg_1324 wbb


Bradyrhizobium_elkanii_peg_6717 wac

Bradyrhizobium_elkanii_peg_6718 wzx

Bradyrhizobium_elkanii_peg_6721 waa

there should be separating line for each cluster

ADD REPLY
1
Entering edit mode
2.1 years ago
Michael 54k

Here's the script. Just needs Perl 5.

ADD COMMENT
0
Entering edit mode

hi, thank you for this script it works. BUT I have a liitle query. i have some proteins that are named like this Bacillus_subtilis_168_peg_0057 wzz

Bacillus_subtilis_168_peg_0274 Wzy_C

Bacillus_subtilis_168_peg_0749 Polysacc_synt_2

Bacillus_subtilis_168_peg_1058 Polysacc_synt_2

Bacillus_subtilis_168_peg_1100 Polysacc_synt_C

Bacillus_subtilis_168_peg_1422 Polysacc_synt Polysacc_synt_C

Bacillus_subtilis_168_peg_1888 wxx

so for this the script is showing error and telling invalid index peg its because the digits before the peg digit . How can I solve this issue . Please do let me know.

ADD REPLY
2
Entering edit mode

Problem is that data you posted is not representative data. So the code fails to accommodate patterns of the original data. No code will be useful if you keep on changing data patterns. and it would be huge effort for any one who works on such data.

ADD REPLY
1
Entering edit mode

As cpad says, it is difficult to hit moving targets... Anyway, I have adapted the script to suit the updated example. If you need more adjustments you either have to post the full dataset or you will be on your own.

ADD REPLY
0
Entering edit mode

I have a dataset like this .

1st coloumn has protein Ids of different organisms and 2nd column has domain names.

Protein Ids domain

Abiotrophia_defectiva_peg_0144 wzz

Abiotrophia_defectiva_peg_0198 wxy

Abiotrophia_defectiva_peg_0200 wzz

Abiotrophia_defectiva_peg_0215 wca

Abyssicoccus_albus_123_peg_1185 wzz

Abyssicoccus_albus_123_peg_1189 wzx

Abyssicoccus_albus_123_peg_1200 wza

Abyssicoccus_albus_123_peg_1322 wca

Abyssicoccus_albus_123_peg_1324 wbb

Bradyrhizobium_elkanii_peg_6717 wac

Bradyrhizobium_elkanii_peg_6718 wzx

Bradyrhizobium_elkanii_peg_6721 waa

Bradyrhizobium_elkanii_peg_6752 wca

Bradyrhizobium_elkanii_peg_6780 wvx

I want to know whic proteins are near by . means according to "peg" numbers if I say protein numbers coming under +/- 5 they are near by and they form cluster.

output should look like :

Abiotrophia_defectiva_peg_0198 wxy

Abiotrophia_defectiva_peg_0200 wzz


Abyssicoccus_albus_123_peg_1185 wzz

Abyssicoccus_albus_123_peg_1189 wzx

Abyssicoccus_albus_123_peg_1322 wca

Abyssicoccus_albus_123_peg_1324 wbb


Bradyrhizobium_elkanii_peg_6717 wac

Bradyrhizobium_elkanii_peg_6718 wzx

Bradyrhizobium_elkanii_peg_6721 waa


there should be separating line for each cluster

Is there any way I can do this task for my data . I need some script for this .

Please do let me know . Thanks

ADD REPLY
1
Entering edit mode
library(dplyr)
library(tidyr)
library(stringr)
> df %>% 
+   separate(Protein.Ids,into = c("sp","peg"), sep = "peg_", convert=T) %>% 
+   group_by(sp) %>% 
+   mutate(state=if_else(peg-lag(peg)<=5,"true","false"),
+          state=ifelse((lead(state)=="true"| state=="true"),"true",state)) %>% 
+   filter(state=="true") %>% 
+   mutate(peg = sprintf("%05d", peg)) %>% 
+   unite("Protein_ID", sp,peg, sep ="peg_") %>% 
+   select(-state)


# A tibble: 9 × 2
  Protein_ID                       domain
  <chr>                            <chr> 
1 Abiotrophia_defectiva_peg_00198  wxy   
2 Abiotrophia_defectiva_peg_00200  wzz   
3 Abyssicoccus_albus_123_peg_01185 wzz   
4 Abyssicoccus_albus_123_peg_01189 wzx   
5 Abyssicoccus_albus_123_peg_01322 wca   
6 Abyssicoccus_albus_123_peg_01324 wbb   
7 Bradyrhizobium_elkanii_peg_06717 wac   
8 Bradyrhizobium_elkanii_peg_06718 wzx   
9 Bradyrhizobium_elkanii_peg_06721 waa  

Underlines under each category may not be possible with this code and also do not use special characters and spaces in column names.

ADD REPLY

Login before adding your answer.

Traffic: 1731 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6