Question

Debugging a sed script to extract values from rows in .txt file

0

Entering edit mode

5.5 years ago

kristina.mahan ▴ 170

I am trying to extract out all the values for a specific quality filter MQRankSum. Someone has given a sed script showing how they did it.

Here is one row of my .txt file all located in column 8:

AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0

I am trying to extract out the values only of MQRankSum which. The sed script provided online was:

cut -f 8 | \
sed 's/^.*;MQRankSum=\(\-\{0,1\}[0-9]\{1,\}.[0-9]*\);.*$/\1/' > MQRankSum.txt

When I used that sed command I mostly extracted the values for MQRankSum but also ended left with rows of text that was missing a notation for MQRankSum:

0.000
AC=2;AF=1.00;AN=2;DP=195;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=0.915
-0.254
0.377
1.943
AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=13.87;SOR=2.303
-1.926
-14.951
-4.042
-7.347
-9.536
-3.781
0.637

I tried to debug the sed script but I am having trouble. I want to graph the MQRankSum values but cannot with the additional text values. What is missing from the sed script that will allow only numbers to pass through to the final .txt file?

unix sed • 2.0k views

ADD COMMENT • link updated 6 months ago by Ram 43k • written 5.5 years ago by kristina.mahan ▴ 170

2

Entering edit mode

with sed:

$ sed 's/.*;\(MQ[A-Za-z]\+\W\{2\}[0-9]\W[0-9]\{3\}\);.*/\1/g' test.txt 
MQRankSum=-7.801

with awk:

$ awk -F ";" '{print $11}' test.txt 
MQRankSum=-7.801

with cut:

$ cut -d ";" -f11 test.txt 
MQRankSum=-7.801

with grep (MQRankSum is always followed by QD):

$ grep -Po 'MQR.*(?=;QD.*)' test.txt 
MQRankSum=-7.801

.

$ cat test.txt 
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL 0/1:246,99:345:99:2909,0

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

I tried to extract values for ReadPosRankSum using same sed command:

cut -f 8 | \
sed 's/^.*;ReadPosRankSum=\(\-\{0,1\}[0-9]\{1,\}.[0-9]*\);.*$/\1/' > ReadPosRankSum.txt

Here are a few lines of the .txt file showing it extracted some but not all of the value even though ReadPosRankSum is present:

0.574
1.757
3.098
AC=1;AF=0.500;AN=2;BaseQRankSum=0.308;DP=274;ExcessHet=3.0103;FS=31.396;MLEAC=1;MLEAF=0.500;MQ=58.45;MQRankSum=-16.410;QD=0.85;ReadPosRankSum=-3.793;SOR=3.537
AC=1;AF=0.500;AN=2;BaseQRankSum=1.851;DP=300;ExcessHet=3.0103;FS=40.009;MLEAC=1;MLEAF=0.500;MQ=58.04;MQRankSum=-14.850;QD=2.70;ReadPosRankSum=-2.540;SOR=3.531
3.074
AC=1;AF=0.500;AN=2;BaseQRankSum=-2.909;DP=400;ExcessHet=3.0103;FS=55.917;MLEAC=1;MLEAF=0.500;MQ=56.63;MQRankSum=-13.338;QD=9.33;ReadPosRankSum=-0.199;SOR=1.422
0.922
1.242
1.698
0.256

ADD REPLY • link 5.5 years ago by kristina.mahan ▴ 170

2

Entering edit mode

You're overcomplicating this. Your values are all delimited by semi-colons, you should use them.

ADD REPLY • link 5.5 years ago by Joe 21k

score 4 · Accepted Answer · 2018-10-11

4

Entering edit mode

5.5 years ago

Joe 21k

Is this all you want? ~~FWIW, the column you claim to want is 11, not 8~~ (OP clarified)

Data:

$ cat string.txt
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0

Isolate the column:

$ cat string.txt | cut -d ';' -f 11
MQRankSum=-7.801

To get just the value itself:

$ cat string.txt | cut -d ';' -f 11 | sed 's/^.*=//g'
-7.801

This should work for any column you want to extract, since they all end in an = before the signed values. Just change -f 11 to suit.

E.g., as you've also asked for ReadPosRankSum:

$ cat string.txt | cut -d ';' -f 13| sed 's/^.*=//gi'
-1.213

ADD COMMENT • link 5.5 years ago by Joe 21k

1

Entering edit mode

If any of these answers were suffcient OP, be sure to select one or more as "Accepted" by clicking the check mark at the left of the answer itself. You can optionally upvote as many answers and comments as you like too.

ADD REPLY • link 5.5 years ago by Joe 21k

0

Entering edit mode

The original vcf has 8 columns and in the 8th column- "info" is has this string all together in same column (multiple rows):

AC=1;AF=0.500;AN=2;BaseQRankSum=-2.790;DP=300;ExcessHet=3.0103;FS=3.218;MLEAC=1;MLEAF=0.500;MQ=55.83;MQRankSum=-11.767;QD=15.75;ReadPosRankSum=2.193;SOR=0.890
AC=1;AF=0.500;AN=2;BaseQRankSum=1.257;DP=331;ExcessHet=3.0103;FS=38.264;MLEAC=1;MLEAF=0.500;MQ=54.96;MQRankSum=-15.246;QD=9.22;ReadPosRankSum=-1.003;SOR=1.559
AC=1;AF=0.500;AN=2;BaseQRankSum=0.766;DP=299;ExcessHet=3.0103;FS=75.704;MLEAC=1;MLEAF=0.500;MQ=55.12;MQRankSum=-16.302;QD=5.78;ReadPosRankSum=-0.123;SOR=2.776
AC=1;AF=0.500;AN=2;BaseQRankSum=3.407;DP=288;ExcessHet=3.0103;FS=112.157;MLEAC=1;MLEAF=0.500;MQ=56.57;MQRankSum=-16.680;QD=3.76;ReadPosRankSum=0.701;SOR=5.548
AC=1;AF=0.500;AN=2;BaseQRankSum=2.296;DP=268;ExcessHet=3.0103;FS=92.461;MLEAC=1;MLEAF=0.500;MQ=56.88;MQRankSum=-15.783;QD=2.25;ReadPosRankSum=-0.610;SOR=5.891
AC=1;AF=0.500;AN=2;BaseQRankSum=-1.613;DP=293;ExcessHet=3.0103;FS=40.946;MLEAC=1;MLEAF=0.500;MQ=57.97;MQRankSum=-15.989;QD=0.38;ReadPosRankSum=1.561;SOR=3.723
AC=1;AF=0.500;AN=2;BaseQRankSum=2.168;DP=318;ExcessHet=3.0103;FS=94.957;MLEAC=1;MLEAF=0.500;MQ=56.24;MQRankSum=-16.658;QD=3.77;ReadPosRankSum=-2.989;SOR=3.777
AC=1;AF=0.500;AN=2;BaseQRankSum=2.111;DP=309;ExcessHet=3.0103;FS=138.322;MLEAC=1;MLEAF=0.500;MQ=55.58;MQRankSum=-16.086;QD=5.71;ReadPosRankSum=2.439;SOR=5.316
AC=1;AF=0.500;AN=2;BaseQRankSum=-1.400;DP=321;ExcessHet=3.0103;FS=114.671;MLEAC=1;MLEAF=0.500;MQ=54.93;MQRankSum=-15.105;QD=11.08;ReadPosRankSum=0.060;SOR=3.316
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.599;DP=338;ExcessHet=3.0103;FS=67.185;MLEAC=1;MLEAF=0.500;MQ=55.90;MQRankSum=-12.540;QD=12.96;ReadPosRankSum=-1.176;SOR=1.882
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.887;DP=362;ExcessHet=3.0103;FS=13.619;MLEAC=1;MLEAF=0.500;MQ=57.49;MQRankSum=-8.685;QD=12.36;ReadPosRankSum=0.814;SOR=0.306
AC=1;AF=0.500;AN=2;BaseQRankSum=3.380;DP=447;ExcessHet=3.0103;FS=13.227;MLEAC=1;MLEAF=0.500;MQ=59.20;MQRankSum=-4.250;QD=14.93;ReadPosRankS

ADD REPLY • link 5.5 years ago by kristina.mahan ▴ 170

1

Entering edit mode

Ah I see. That's fine, just use your cut -f 8 command and pipe it to my command above... e.g.

$ cat bigvcffile.txt | cut -f 8 | cut -d ';' -f 11 ...

ADD REPLY • link 5.5 years ago by Joe 21k

1

Entering edit mode

can you interpret the sed command you are suggesting so that I can understand it? Please :)

ADD REPLY • link 5.5 years ago by kristina.mahan ▴ 170

1

Entering edit mode

The sed command works because the result of

$ cat string.txt | cut -d ';' -f 11

Gives the full string between semi-colons:

MQRankSum=-7.801

The sed command then says, taking this string as input substitute (s/) from the start of the string (^), encompassing any character (.) any number of times (*), until the literal = sign is met, then replace this with nothing (globally, though thats not really necessary in this case) (//g).

Maybe this visualisation will help:

String:                MQRankSum=-7.801
What sed sees:        ^.........=

You could equally do it the other way around and capture the decimal instead, but this seemed easier to me.

ADD REPLY • link 5.5 years ago by Joe 21k

1

Entering edit mode

You should say from the start it is a vcf file. Modifying jrj.healey answer:

cat file.vcf | cut -f 8 | cut -d ';' -f 11 | sed 's/^.*=//g'

ADD REPLY • link 5.5 years ago by h.mon 35k

0

Entering edit mode

here is a few rows of the original vcf showing that the info column as the 8th column:

unitig_2993_pilon   30545   .   G   T   2880.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0,8546
unitig_2993_pilon   30643   .   CA  C   5424.73 .   AC=1;AF=0.500;AN=2;BaseQRankSum=2.053;DP=432;ExcessHet=3.0103;FS=8.204;MLEAC=1;MLEAF=0.500;MQ=56.61;MQRankSum=-11.833;QD=13.01;ReadPosRankSum=2.192;SOR=0.926   GT:AD:DP:GQ:PL  0/1:232,185:417:99:5462,0,7701
unitig_2993_pilon   30644   .   AG  A   7051.73 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.974;DP=431;ExcessHet=3.0103;FS=9.401;MLEAC=1;MLEAF=0.500;MQ=56.68;MQRankSum=11.770;QD=16.91;ReadPosRankSum=-2.205;SOR=0.660  GT:AD:DP:GQ:PL  0/1:186,231:417:99:7089,0,6022
unitig_2993_pilon   30709   .   AG  A   7972.73 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.370;DP=479;ExcessHet=3.0103;FS=1.108;MLEAC=1;MLEAF=0.500;MQ=57.40;MQRankSum=11.870;QD=17.56;ReadPosRankSum=-0.052;SOR=0.756  GT:AD:DP:GQ:PL  0/1:199,255:454:99:8010,0,6428

ADD REPLY • link 5.5 years ago by kristina.mahan ▴ 170

2

Entering edit mode

I would go with vcf parsing tools like: bgzip VCF and index vcf with tabix, with bcftools:

$ bcftools query -f '%MQRankSum\n' test.vcf.gz

ADD REPLY • link 5.5 years ago by cpad0112 21k