create a gene2accession file from a 2 column file
1
0
Entering edit mode
7.8 years ago
Illinu ▴ 110

I need to create a gene2accession file where one gene ID might have one or more Uniprot entries. My file looks like this:

  comp100002_c0 Q9FFI3
  comp100004_c0 B9DHK3
  comp100004_c0 F4J3J5
  comp100005_c0 P54150

and I need it to look like this:

  comp100002_c0 Q9FFI3
  comp100004_c0 B9DHK3|F4J3J5   
  comp100005_c0 P54150

I tried a python script but it doesn't work and after tweaking the code around for a while I am quite stuck. Does anyone have a code that works to get this outcome? Thanks

my attempt in python:

  f1 = open(sys.argv[1], 'rU')
  lines = f1.readlines()
  for i in range(0, len(lines)):
      line = lines[i]
      next_l = f1.next()
      splitline = line.split('\t')
      splitnext = next_l.split('\t')
      if splitline[0] == splitnext[0]:
              print splitline[0] + '\t' + splitline[1] + '|' + splitnext[1]
      else:
          print line
go enrichment cluego uniprot gene2accession • 1.8k views
ADD COMMENT
3
Entering edit mode
7.8 years ago
EagleEye 7.5k

I guess this post doesn't fits this forum but I hope this solution serves your purpose.

Group by first column using AWK,

awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "| " } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' FILE_INPUT_TAB_SEPARATED

OUTPUT

comp100002_c0   Q9FFI3|
comp100004_c0   B9DHK3| F4J3J5|
comp100005_c0   P54150|

If you would like without last occurrence of '|' from the output, use this

awk 'BEGIN{FS="\t"}{ if( !seen[$1]++ ) order[++oidx] = $1; stuff[$1] = stuff[$1] $2 "| " } END { for( i = 1; i <= oidx; i++ ) print order[i]"\t"stuff[order[i]] }' FILE_INPUT_TAB_SEPARATED | sed 's/(.*)|/\1/'

comp100002_c0   Q9FFI3 
comp100004_c0   B9DHK3| F4J3J5 
comp100005_c0   P54150

ADD COMMENT
1
Entering edit mode

Thanks a million @EagleEye, I asked in stackoverflow and my question was blocked because it was 'unclear' what I was asking for. Note that to get rid of the last occurrence of '|' I used sed 's/BACK_SLASH(.*)|/\1/'. Thanks again

ADD REPLY

Login before adding your answer.

Traffic: 1508 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6