Extract and retain only chromosome info in GWAS sumstats file
2
1
Entering edit mode
3.0 years ago
camerond ▴ 170

I have a GWAS sumstats file called sumstats.txt in the following format:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10:101574552:A:ATG      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10:10222597:AT:A        0       10      10222597        a       at      0.9997  0.01    0.9777


For every entry in the sumstats file I need to munge column 1 to retain only the chromosomal information, effectively deleting everything after the first colon, whilst leaving the rest of the file intact.

The final file would look like this:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10      0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10      0       10      10222597        a       at      0.9997  0.01    0.9777


My awk/sed knowledge is not great but I've had a bash (npi!!):

echo "10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812" | awk '{n=split($0,a,/[:_]/); print a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'  Giving: 10 100968448 100968449 T/AA 0.3519 10 100968448 t aa 1.0024 0.01 0.812  But not sure how to get rid of the info after the colon and pipe the file in correctly. Also tried: echo "10:100968448:T:AA 0.3519 10 100968448 t aa 1.0024 0.01 0.812" | awk '{n=split($0,a,/[:_]/); print a[1];}'


Returning:

10


I'm looking for a bash/awk/sed one liner for this as I have quite a few sumstats files to process.

Any suggestions would be greatly appreciated.

bash GWAS sumstats awk sed munging • 559 views
3
Entering edit mode
3.0 years ago
shawn.w.foley ★ 1.3k

It looks like your goal is to split the first field on the colon, then print the rest of the file. Your awk script is splitting the line not the field. You can try:

awk 'BEGIN{OFS="\t";FS="\t"} {split($1,ARR,":");$1 = ARR[1]; print $0}' inFile.txt > outFile.txt  This will yield: SNP Freq.A1 CHR BP A1 A2 OR SE P 10 0.3519 10 100968448 t aa 1.0024 0.01 0.812 10 0.4493 10 101574552 a atg 0.98906 0.0097 0.2585 10 0 10 10222597 a at 0.9997 0.01 0.9777  The awk script is doing a few things: BEGIN{OFS="\t";FS="\t"} is defining the Output Field Separator and Field Separator as tabs. split($1,ARR,":") splits just the first field on the colon, and stores each element in the array ARR

$1 = ARR[1] redefines the first field as the first array element (just the chromosome in this case). print$0 prints the entire line (with the newly defined first field).

1
Entering edit mode

Many Thanks @shawn.w.foley

Your explanation will help me on future munging attempts.

1
Entering edit mode
3.0 years ago

something like this ?

sed 's/^$$[^:]*$$:[^ \t]*/\1\t/'