Question: how to write scripts to split GO terms as one per line
0
gravatar for Alexie Li
4 months ago by
Alexie Li20
Alexie Li20 wrote:

Hi all,

I got a file with the first column containing id and second column containing annotated gene ontology numbers. As the following

CPIW_00004002-RA GO:0005515

CPIW_00004002-RA GO:0010997|GO:0097027|GO:1904668

CPIW_00004003-RA GO:0003824|GO:0008152

CPIW_00004003-RA GO:0003987|GO:0016208|GO:0019427

CPIW_00004004-RA GO:0006506|GO:0016021|GO:0016758

CPIW_00004005-RA GO:0004360|GO:1901137

CPIW_00004005-RA GO:0097367|GO:1901135

CPIW_00004006-RA GO:0005515

CPIW_00004007-RA GO:0016787

CPIW_00004016-RA GO:0003824|GO:0046872

I want to split them as one id with one GO term, as

CPIW_00004002-RA GO:0005515

CPIW_00004002-RA GO:0010997

CPIW_00004002-RA GO:0097027

CPIW_00004002-RA GO:1904668

CPIW_00004003-RA GO:0003824

CPIW_00004003-RA GO:0008152

How to write a script to make this work?

Thanks!

Alexie

linux file format python perl • 258 views
ADD COMMENTlink modified 4 months ago by Chirag Parsania430 • written 4 months ago by Alexie Li20

This is a programming question, not a bioinformatics one. Ask on StackOverflow.

ADD REPLYlink written 4 months ago by Jean-Karim Heriche13k
2
gravatar for Pierre Lindenbaum
4 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:
 awk '{n=split($2,a,/\|/); for(i=1;i<=n;++i) print $1,a[i];}' input.txt
ADD COMMENTlink written 4 months ago by Pierre Lindenbaum101k
2
gravatar for st.ph.n
4 months ago by
st.ph.n1.9k
Philadelphia, PA
st.ph.n1.9k wrote:
#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
        for line in f:
                for i in line.strip().split('\t')[1].split('|'):
                        print line.strip().split('\t')[0], '\t', i
ADD COMMENTlink written 4 months ago by st.ph.n1.9k
1
gravatar for Chirag Parsania
4 months ago by
University of Macau
Chirag Parsania430 wrote:

Using R

library("psych") ## read data from clipboard 
dat <- read.clipboard(header = F)
dat <- apply(dat,2,as.character)
out <- apply(dat,1,function(elem){
        geneId <- elem[1]
        goIds <- elem[2]
        splitted <-  unlist(strsplit(goIds,'\\|',))
        return(cbind(geneID=rep(geneId,length(splitted)), splitted))
})

do.call("rbind",out)

EDIT: 2 Alternative, only few lines

library("psych") ## read data from clipboard 
dat <- read.clipboard(header = F)
out <- lapply(as.character(dat$V2),function(elem){unlist(strsplit(elem,"\\|"))})
cbind(rep(as.character(dat$V1),lengths(out)), unlist(out))
ADD COMMENTlink modified 4 months ago • written 4 months ago by Chirag Parsania430
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 693 users visited in the last hour