I have a Beagle phased output and I want to compare consecutive columns of a file and return the number of matched elements. I would prefer to use shell scripting or awk. Here is a sample bash/AWK script that I am trying to use.
!/bin/bash
for i in 3 4 5 6 7 8 9
do
  for j in 3 4 5 6 7 8 9
   do
    awk "$i == $j" phased.txt | wc -l
  done
done
I have a file of size 147189828 and I want to compare each columns and return the number of matched elements in a 828\828 matrix (A similarity matrix). This would be fairly easy in MATLAB, but, it takes a long time to load huge files. I can compare two columns and return the number of matched elements with the following awk command: awk '$3==$4' phased.txt | wc -l, but would need some help to do it for the entire file.
A snippet of the data:
# sampleID   HGDP00511  HGDP00511   HGDP00512   HGDP00512   HGDP00513   HGDP00513
M rs4124251       0                     0                      A                     G                  0                        A
M rs6650104       0                     A                      C                     T                  0                        0
M rs12184279      0                    0                      G                      A                 T                        0
..
..
                    
                
                
Always show a snippet of data, as I have no idea what a phased beagle file is, but I can help you with comparison.
Hi Sukhdeep,
Thanks for reaching out. I have posted a snippet of the sample data. Your help is much appreciated.