Question

Make a .gmt file using bash for loop

2

Entering edit mode

9.4 years ago

jon.brate ▴ 310

I want to create a .gmt file for visualizing Gene enrichment sets in the EnrichmentMap plugin for cytoscape

My input file is like this:

scigt000016    GO:0005515
scigt000021    GO:0005515
scigt000027    GO:0044464
scigt000010    GO:0005515
scigt000011    GO:0015074

And I want to convert it like this:

GO:0005515 NA scigt000016 scigt000021 scigt000010
GO:0044464 NA scigt000027
GO:0015074 NA scigt000011

So basically putting the GO-term in column 1, some random text in col2 and the genes from col1 in the input file on a line after the go-term.

I was thinking to use a for loop that greps the go-term from column 2 in the input file, then for each line append col1 to the end of the line. But really I am quite lost here.

bash gmt • 4.9k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by jon.brate ▴ 310

1

Entering edit mode

9.4 years ago

Zhilong Jia ★ 2.2k

imagine aa.t is the input file and bb.t is the output

awk 'BEGIN{FS=OFS="\t"}{data[$1] = data[$1] "\t" $2}END{for (i in data) print i, "NA", data[i]}' aa.t | sed s/\\t\\t/\\t/ - > bb.t

EDIT: the sep should be \t in the input file.

ADD COMMENT • link 9.4 years ago by Zhilong Jia ★ 2.2k

Ram · Accepted Answer · 2014-11-28

6

Entering edit mode

9.4 years ago

Pierre Lindenbaum 161k

sort on 2nd column, use awk to retrieve the 'previous' value of the 2nd column. If it isn't the same, start a new line.

sort -k2,2 input.txt  | awk '{if($2!=prev) {printf("%s%s\tNA\t%s",(NR==1?"":"\n"),$2,$1);prev=$2;next;} else printf(" %s",$1);} END{printf("\n");}'

GO:0005515    NA     scigt000010 scigt000016 scigt000021
GO:0015074    NA     scigt000011
GO:0044464    NA     scigt000027

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Beautiful solution, thanks!

ADD REPLY • link 9.4 years ago by jon.brate ▴ 310

0

Entering edit mode

this command is great but in long line give truncate output damage the format.

ADD REPLY • link 5.8 years ago by Mostafa Mahmoud • 0

0

Entering edit mode

What is happening with signposting the colon in this one liner? As when I use this code with gene-set names which contain a colon, it works beautifully. However (to avoid downstream hiccups in R), I tend to strip non-alphanumeric characters from my variables. If I run the code with those gene-set names, the "NA" appears somewhere in the middle of the gene-set names. So what is awk doing with (NR==1?"":"\n") and is there a way of producing the same result with gene-set names that lack the colon, but same format conversion of gene_name "\t" gene_set -> gene_set "\t" "NA" "\t" gene_name ?

Thanks!

ADD REPLY • link 5.3 years ago by lynsey.hall • 0