Question: Extract and retain only chromosome info in GWAS sumstats file
1
gravatar for camerond
8 weeks ago by
camerond70
camerond70 wrote:

I have a GWAS sumstats file called sumstats.txt in the following format:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10:101574552:A:ATG      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10:10222597:AT:A        0       10      10222597        a       at      0.9997  0.01    0.9777

For every entry in the sumstats file I need to munge column 1 to retain only the chromosomal information, effectively deleting everything after the first colon, whilst leaving the rest of the file intact.

The final file would look like this:

SNP     Freq.A1 CHR     BP      A1      A2      OR      SE      P
10      0.3519  10      100968448       t       aa      1.0024  0.01    0.812
10      0.4493  10      101574552       a       atg     0.98906 0.0097  0.2585
10      0       10      10222597        a       at      0.9997  0.01    0.9777

My awk/sed knowledge is not great but I've had a bash (npi!!):

echo "10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812" | awk '{n=split($0,a,/[:_]/); print a[1]"\t"a[2]"\t"a[2]+1"\t"a[3]"/"a[4];}'

Giving:

10   100968448       100968449       T/AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812

But not sure how to get rid of the info after the colon and pipe the file in correctly. Also tried:

echo "10:100968448:T:AA       0.3519  10      100968448       t       aa      1.0024  0.01    0.812" | awk '{n=split($0,a,/[:_]/); print a[1];}'

Returning:

10

I'm looking for a bash/awk/sed one liner for this as I have quite a few sumstats files to process.

Any suggestions would be greatly appreciated.

awk munging gwas sumstats sed bash • 102 views
ADD COMMENTlink modified 8 weeks ago by shawn.w.foley780 • written 8 weeks ago by camerond70
3
gravatar for shawn.w.foley
8 weeks ago by
shawn.w.foley780
USA
shawn.w.foley780 wrote:

It looks like your goal is to split the first field on the colon, then print the rest of the file. Your awk script is splitting the line not the field. You can try:

awk 'BEGIN{OFS="\t";FS="\t"} {split($1,ARR,":"); $1 = ARR[1]; print $0}' inFile.txt > outFile.txt

This will yield:

SNP Freq.A1 CHR BP  A1  A2  OR  SE  P
10  0.3519  10  100968448   t   aa  1.0024  0.01    0.812
10  0.4493  10  101574552   a   atg 0.98906 0.0097  0.2585
10  0   10  10222597    a   at  0.9997  0.01    0.9777

The awk script is doing a few things:

BEGIN{OFS="\t";FS="\t"} is defining the Output Field Separator and Field Separator as tabs.

split($1,ARR,":") splits just the first field on the colon, and stores each element in the array ARR

$1 = ARR[1] redefines the first field as the first array element (just the chromosome in this case).

print $0 prints the entire line (with the newly defined first field).

ADD COMMENTlink written 8 weeks ago by shawn.w.foley780
1

Many Thanks @shawn.w.foley

Your explanation will help me on future munging attempts.

ADD REPLYlink written 8 weeks ago by camerond70
1
gravatar for Pierre Lindenbaum
8 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

something like this ?

sed 's/^\([^:]*\):[^ \t]*/\1\t/'
ADD COMMENTlink written 8 weeks ago by Pierre Lindenbaum121k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1276 users visited in the last hour