Question: how to write scripts to split GO terms as one per line
0
gravatar for Alexie Li
8 weeks ago by
Alexie Li20
Alexie Li20 wrote:

Hi all,

I got a file with the first column containing id and second column containing annotated gene ontology numbers. As the following

CPIW_00004002-RA GO:0005515

CPIW_00004002-RA GO:0010997|GO:0097027|GO:1904668

CPIW_00004003-RA GO:0003824|GO:0008152

CPIW_00004003-RA GO:0003987|GO:0016208|GO:0019427

CPIW_00004004-RA GO:0006506|GO:0016021|GO:0016758

CPIW_00004005-RA GO:0004360|GO:1901137

CPIW_00004005-RA GO:0097367|GO:1901135

CPIW_00004006-RA GO:0005515

CPIW_00004007-RA GO:0016787

CPIW_00004016-RA GO:0003824|GO:0046872

I want to split them as one id with one GO term, as

CPIW_00004002-RA GO:0005515

CPIW_00004002-RA GO:0010997

CPIW_00004002-RA GO:0097027

CPIW_00004002-RA GO:1904668

CPIW_00004003-RA GO:0003824

CPIW_00004003-RA GO:0008152

How to write a script to make this work?

Thanks!

Alexie

linux file format python perl • 196 views
ADD COMMENTlink modified 8 weeks ago by Chirag Parsania430 • written 8 weeks ago by Alexie Li20

This is a programming question, not a bioinformatics one. Ask on StackOverflow.

ADD REPLYlink written 8 weeks ago by Jean-Karim Heriche13k
2
gravatar for Pierre Lindenbaum
8 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:
 awk '{n=split($2,a,/\|/); for(i=1;i<=n;++i) print $1,a[i];}' input.txt
ADD COMMENTlink written 8 weeks ago by Pierre Lindenbaum98k
2
gravatar for st.ph.n
8 weeks ago by
st.ph.n1.6k
Philadelphia, PA
st.ph.n1.6k wrote:
#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
        for line in f:
                for i in line.strip().split('\t')[1].split('|'):
                        print line.strip().split('\t')[0], '\t', i
ADD COMMENTlink written 8 weeks ago by st.ph.n1.6k
1
gravatar for Chirag Parsania
8 weeks ago by
University of Macau
Chirag Parsania430 wrote:

Using R

library("psych") ## read data from clipboard 
dat <- read.clipboard(header = F)
dat <- apply(dat,2,as.character)
out <- apply(dat,1,function(elem){
        geneId <- elem[1]
        goIds <- elem[2]
        splitted <-  unlist(strsplit(goIds,'\\|',))
        return(cbind(geneID=rep(geneId,length(splitted)), splitted))
})

do.call("rbind",out)

EDIT: 2 Alternative, only few lines

library("psych") ## read data from clipboard 
dat <- read.clipboard(header = F)
out <- lapply(as.character(dat$V2),function(elem){unlist(strsplit(elem,"\\|"))})
cbind(rep(as.character(dat$V1),lengths(out)), unlist(out))
ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by Chirag Parsania430
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1387 users visited in the last hour