AWK Join two files does not work
2
1
Entering edit mode
6.2 years ago
pablosolar.r ▴ 20

Hi all,

I have these two .dat files (I only show the first 20 lines for both):

GO:0005509  PDCD6
GO:0004672  CDK1
GO:0005524  CDK1
GO:0005634  CDK1
GO:0005737  CDK1
GO:0006468  CDK1
GO:0005615  SERPINB6
GO:0006629  APOC2
GO:0006869  APOC2
GO:0008047  APOC2
GO:0042627  APOC2
GO:0043085  APOC2
GO:0001932  TADA2L
GO:0003677  TADA2L
GO:0005671  TADA2L
GO:0006357  TADA2L
GO:0007067  TADA2L
GO:0008270  TADA2L
GO:0016573  TADA2L

And

GO:0000001  mitochondrion inheritance
GO:0000002  mitochondrial genome maintenance
GO:0000003  reproduction
GO:0000005  ribosomal chaperone activity
GO:0000006  high affinity zinc uptake transmembrane transporter activity
GO:0000007  low-affinity zinc ion transmembrane transporter activity
GO:0000008  thioredoxin
GO:0000009  alpha-1,6-mannosyltransferase activity
GO:0000010  trans-hexaprenyltranstransferase activity
GO:0000011  vacuole inheritance
GO:0000012  single strand break repair
GO:0000014  single-stranded DNA specific endodeoxyribonuclease activity
GO:0000015  phosphopyruvate hydratase complex
GO:0000016  lactase activity
GO:0000017  alpha-glucoside transport
GO:0000018  regulation of DNA recombination
GO:0000019  regulation of mitotic recombination
GO:0000020  negative regulation of recombination within rDNA repeats
(...)

When I try to make a join for both files, I only get a few results (exactly 10). The complete code is:

ls *gene_association* | while read file;
do
echo;
echo @@@ File: $file;
echo;

# New file "assoc_specie.txt"
IFS='_' read -r -a array <<< "$file"
SPECIE=${array[2]}

#Filtering comments (!comment...)
cat $file | grep -v '!' > assoc_$SPECIE.txt;
gawk 'BEGIN{OFS="\t";FS="\t"}{print $5, $3}' assoc_$ESPECIE.txt > goTerms_$ESPECIE.dat;
join goTerms_$SPECIE.dat gene_ontology.dat > join.dat

echo
done;

I don't know what I am doing wrong, but it's obvious that join is not showing all the results.

Thanks in advance

PS: assoc_specie.txt file has this format (only showing first line):

UniProtKB   A0A024QZ42  PDCD6       GO:0005509  GO_REF:0000002  IEA InterPro:IPR002048  F   HCG1985580, isoform CRA_c   A0A024QZ42_HUMAN|PDCD6|hCG_1985580  protein taxon:9606  20160312    InterPro
(...)
awk gawk bash refgene geneOntology • 1.5k views
ADD COMMENT
2
Entering edit mode
6.2 years ago
pablosolar.r ▴ 20

Hi all,

I just saw the mistake. I forgot to sort the file:

(...)
    gawk 'BEGIN{OFS="\t";FS="\t"}{print $5, $3}' assoc_$ESPECIE.txt | sort | uniq > goTerms_$ESPECIE.dat;
(...)
ADD COMMENT
2
Entering edit mode
6.2 years ago
lh3 33k

I would do this:

awk 'BEGIN{FS=OFS="\t";while((getline<"file1.txt")>0)l[$1]=$2}l[$1]{print $1,l[$1],$2}' file2.txt
ADD COMMENT

Login before adding your answer.

Traffic: 968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6