Question: Extract and retain only chromosome info in GWAS sumstats file
1
gravatar for camerond
12 months ago by
camerond150
Cardiff
camerond150 wrote:

I have a GWAS sumstats file called sumstats.txt in the following format:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10:101574552:A:ATG      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10:10222597:AT:A        0       10      10222597        a       at      0.9997  0.01    0.9777

For every entry in the sumstats file I need to munge column 1 to retain only the chromosomal information, effectively deleting everything after the first colon, whilst leaving the rest of the file intact.

The final file would look like this:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10      0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10      0       10      10222597        a       at      0.9997  0.01    0.9777

My awk/sed knowledge is not great but I've had a bash (npi!!):

echo "10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812" | awk '{n=split($0,a,/[:_]/); print a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'

Giving:

10   100968448       100968449       T/AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812

But not sure how to get rid of the info after the colon and pipe the file in correctly. Also tried:

echo "10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812" | awk '{n=split($0,a,/[:_]/); print a[1];}'

Returning:

10

I'm looking for a bash/awk/sed one liner for this as I have quite a few sumstats files to process.

Any suggestions would be greatly appreciated.

awk munging gwas sumstats sed bash • 241 views
ADD COMMENTlink modified 12 months ago by shawn.w.foley1.2k • written 12 months ago by camerond150
3
gravatar for shawn.w.foley
12 months ago by
shawn.w.foley1.2k
USA
shawn.w.foley1.2k wrote:

It looks like your goal is to split the first field on the colon, then print the rest of the file. Your awk script is splitting the line not the field. You can try:

awk 'BEGIN{OFS="\t";FS="\t"} {split($1,ARR,":"); $1 = ARR[1]; print $0}' inFile.txt > outFile.txt

This will yield:

SNP Freq.A1 CHR BP  A1  A2  OR  SE  P
10  0.3519  10  100968448   t   aa  1.0024  0.01    0.812
10  0.4493  10  101574552   a   atg 0.98906 0.0097  0.2585
10  0   10  10222597    a   at  0.9997  0.01    0.9777

The awk script is doing a few things:

BEGIN{OFS="\t";FS="\t"} is defining the Output Field Separator and Field Separator as tabs.

split($1,ARR,":") splits just the first field on the colon, and stores each element in the array ARR

$1 = ARR[1] redefines the first field as the first array element (just the chromosome in this case).

print $0 prints the entire line (with the newly defined first field).

ADD COMMENTlink written 12 months ago by shawn.w.foley1.2k
1

Many Thanks @shawn.w.foley

Your explanation will help me on future munging attempts.

ADD REPLYlink written 12 months ago by camerond150
1
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum128k wrote:

something like this ?

sed 's/^\([^:]*\):[^ \t]*/\1\t/'
ADD COMMENTlink written 12 months ago by Pierre Lindenbaum128k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1749 users visited in the last hour