Question

AWK Join two files does not work

1

Entering edit mode

8.1 years ago

pablosolar.r ▴ 20

Hi all,

I have these two .dat files (I only show the first 20 lines for both):

GO:0005509  PDCD6
GO:0004672  CDK1
GO:0005524  CDK1
GO:0005634  CDK1
GO:0005737  CDK1
GO:0006468  CDK1
GO:0005615  SERPINB6
GO:0006629  APOC2
GO:0006869  APOC2
GO:0008047  APOC2
GO:0042627  APOC2
GO:0043085  APOC2
GO:0001932  TADA2L
GO:0003677  TADA2L
GO:0005671  TADA2L
GO:0006357  TADA2L
GO:0007067  TADA2L
GO:0008270  TADA2L
GO:0016573  TADA2L

And

GO:0000001  mitochondrion inheritance
GO:0000002  mitochondrial genome maintenance
GO:0000003  reproduction
GO:0000005  ribosomal chaperone activity
GO:0000006  high affinity zinc uptake transmembrane transporter activity
GO:0000007  low-affinity zinc ion transmembrane transporter activity
GO:0000008  thioredoxin
GO:0000009  alpha-1,6-mannosyltransferase activity
GO:0000010  trans-hexaprenyltranstransferase activity
GO:0000011  vacuole inheritance
GO:0000012  single strand break repair
GO:0000014  single-stranded DNA specific endodeoxyribonuclease activity
GO:0000015  phosphopyruvate hydratase complex
GO:0000016  lactase activity
GO:0000017  alpha-glucoside transport
GO:0000018  regulation of DNA recombination
GO:0000019  regulation of mitotic recombination
GO:0000020  negative regulation of recombination within rDNA repeats
(...)

When I try to make a join for both files, I only get a few results (exactly 10). The complete code is:

ls *gene_association* | while read file;
do
echo;
echo @@@ File: $file;
echo;

# New file "assoc_specie.txt"
IFS='_' read -r -a array <<< "$file"
SPECIE=${array[2]}

#Filtering comments (!comment...)
cat $file | grep -v '!' > assoc_$SPECIE.txt;
gawk 'BEGIN{OFS="\t";FS="\t"}{print $5, $3}' assoc_$ESPECIE.txt > goTerms_$ESPECIE.dat;
join goTerms_$SPECIE.dat gene_ontology.dat > join.dat

echo
done;

I don't know what I am doing wrong, but it's obvious that join is not showing all the results.

Thanks in advance

PS: assoc_specie.txt file has this format (only showing first line):

UniProtKB   A0A024QZ42  PDCD6       GO:0005509  GO_REF:0000002  IEA InterPro:IPR002048  F   HCG1985580, isoform CRA_c   A0A024QZ42_HUMAN|PDCD6|hCG_1985580  protein taxon:9606  20160312    InterPro
(...)

awk gawk bash refgene geneOntology • 1.9k views

ADD COMMENT • link updated 8.1 years ago by lh3 33k • written 8.1 years ago by pablosolar.r ▴ 20

2

Entering edit mode

8.1 years ago

lh3 33k

I would do this:

awk 'BEGIN{FS=OFS="\t";while((getline<"file1.txt")>0)l[$1]=$2}l[$1]{print $1,l[$1],$2}' file2.txt

ADD COMMENT • link 8.1 years ago by lh3 33k

score 2 · Accepted Answer · 2016-04-05

2

Entering edit mode

8.1 years ago

pablosolar.r ▴ 20

Hi all,

I just saw the mistake. I forgot to sort the file:

(...)
    gawk 'BEGIN{OFS="\t";FS="\t"}{print $5, $3}' assoc_$ESPECIE.txt | sort | uniq > goTerms_$ESPECIE.dat;
(...)

ADD COMMENT • link 8.1 years ago by pablosolar.r ▴ 20