Question: Make a .gmt file using bash for loop
gravatar for jon.brate
5.2 years ago by
jon.brate250 wrote:

I want to create a .gmt file for visualizing Gene enrichment sets in the EnrichmentMap plugin for cytoscape

My input file is like this:

scigt000016    GO:0005515

scigt000021    GO:0005515

scigt000027    GO:0044464

scigt000010    GO:0005515

scigt000011    GO:0015074


And I want to convert it like this:

GO:0005515 NA scigt000016 scigt000021 scigt000010

GO:0044464 NA scigt000027

GO:0015074 NA scigt000011


So basically putting the GO-term in column 1, some random text in col2 and the genes from col1 in the input file on a line after the go-term.

I was thinking to use a for loop that greps the go-term from column 2 in the input file, then for each line append col1 to the end of the line. But really I am quite lost here.



bash for gmt loop • 2.4k views
ADD COMMENTlink modified 5.2 years ago by Zhilong Jia1.5k • written 5.2 years ago by jon.brate250
gravatar for Pierre Lindenbaum
5.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:

sort on 2nd column, use awk to retrieve the 'previous' value of the 2nd column. If it isn't the same, start a new line.


sort -k2,2 input.txt  | awk '{if($2!=prev) {printf("%s%s\tNA\t%s",(NR==1?"":"\n"),$2,$1);prev=$2;next;} else printf(" %s",$1);} END{printf("\n");}'

GO:0005515    NA     scigt000010 scigt000016 scigt000021
GO:0015074    NA     scigt000011
GO:0044464    NA     scigt000027


ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Pierre Lindenbaum125k

Beautiful solution, thanks!

ADD REPLYlink written 5.2 years ago by jon.brate250

this command is great but in long line give truncate output damage the format.

ADD REPLYlink modified 19 months ago • written 19 months ago by Mostafa Mahmoud0

What is happening with signposting the colon in this one liner? As when I use this code with gene-set names which contain a colon, it works beautifully. However (to avoid downstream hiccups in R), I tend to strip non-alphanumeric characters from my variables. If I run the code with those gene-set names, the "NA" appears somewhere in the middle of the gene-set names. So what is awk doing with (NR==1?"":"\n") and is there a way of producing the same result with gene-set names that lack the colon, but same format conversion of gene_name "\t" gene_set -> gene_set "\t" "NA" "\t" gene_name ?


ADD REPLYlink written 12 months ago by lynsey.hall0
gravatar for Zhilong Jia
5.2 years ago by
Zhilong Jia1.5k
Zhilong Jia1.5k wrote:

imagine aa.t is the input file and bb.t is the output

awk 'BEGIN{FS=OFS="\t"}{data[$1] = data[$1] "\t" $2}END{for (i in data) print i, "NA", data[i]}' aa.t | sed s/\\t\\t/\\t/ - > bb.t

EDIT: the sep should be \t in the input file.

ADD COMMENTlink modified 5.2 years ago • written 5.2 years ago by Zhilong Jia1.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1924 users visited in the last hour