Extract and retain only chromosome info in GWAS sumstats file
2
1
Entering edit mode
3.0 years ago
camerond ▴ 170

I have a GWAS sumstats file called sumstats.txt in the following format:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10:101574552:A:ATG      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10:10222597:AT:A        0       10      10222597        a       at      0.9997  0.01    0.9777

For every entry in the sumstats file I need to munge column 1 to retain only the chromosomal information, effectively deleting everything after the first colon, whilst leaving the rest of the file intact.

The final file would look like this:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10      0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10      0       10      10222597        a       at      0.9997  0.01    0.9777

My awk/sed knowledge is not great but I've had a bash (npi!!):

echo "10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812" | awk '{n=split($0,a,/[:_]/); print a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'

Giving:

10   100968448       100968449       T/AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812

But not sure how to get rid of the info after the colon and pipe the file in correctly. Also tried:

echo "10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812" | awk '{n=split($0,a,/[:_]/); print a[1];}'

Returning:

10

I'm looking for a bash/awk/sed one liner for this as I have quite a few sumstats files to process.

Any suggestions would be greatly appreciated.

bash GWAS sumstats awk sed munging • 559 views
ADD COMMENT
3
Entering edit mode
3.0 years ago
shawn.w.foley ★ 1.3k

It looks like your goal is to split the first field on the colon, then print the rest of the file. Your awk script is splitting the line not the field. You can try:

awk 'BEGIN{OFS="\t";FS="\t"} {split($1,ARR,":"); $1 = ARR[1]; print $0}' inFile.txt > outFile.txt

This will yield:

SNP Freq.A1 CHR BP  A1  A2  OR  SE  P
10  0.3519  10  100968448   t   aa  1.0024  0.01    0.812
10  0.4493  10  101574552   a   atg 0.98906 0.0097  0.2585
10  0   10  10222597    a   at  0.9997  0.01    0.9777

The awk script is doing a few things:

BEGIN{OFS="\t";FS="\t"} is defining the Output Field Separator and Field Separator as tabs.

split($1,ARR,":") splits just the first field on the colon, and stores each element in the array ARR

$1 = ARR[1] redefines the first field as the first array element (just the chromosome in this case).

print $0 prints the entire line (with the newly defined first field).

ADD COMMENT
1
Entering edit mode

Many Thanks @shawn.w.foley

Your explanation will help me on future munging attempts.

ADD REPLY
1
Entering edit mode
3.0 years ago

something like this ?

sed 's/^\([^:]*\):[^ \t]*/\1\t/'
ADD COMMENT

Login before adding your answer.

Traffic: 2002 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6