Question: Debugging a sed script to extract values from rows in .txt file
0
gravatar for kristina.mahan
9 weeks ago by
kristina.mahan20 wrote:

I am trying to extract out all the values for a specific quality filter MQRankSum. Someone has given a sed script showing how they did it.

Here is one row of my .txt file all located in column 8:

AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0

I am trying to extract out the values only of MQRankSum which. The sed script provided online was:

cut -f 8 | \
sed 's/^.*;MQRankSum=\(\-\{0,1\}[0-9]\{1,\}.[0-9]*\);.*$/\1/' > MQRankSum.txt

When I used that sed command I mostly extracted the values for MQRankSum but also ended left with rows of text that was missing a notation for MQRankSum:

0.000
AC=2;AF=1.00;AN=2;DP=195;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=0.915
-0.254
0.377
1.943
AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=13.87;SOR=2.303
-1.926
-14.951
-4.042
-7.347
-9.536
-3.781
0.637

I tried to debug the sed script but I am having trouble. I want to graph the MQRankSum values but cannot with the additional text values. What is missing from the sed script that will allow only numbers to pass through to the final .txt file?

regex unix sed • 230 views
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by kristina.mahan20
2

with sed:

$ sed 's/.*;\(MQ[A-Za-z]\+\W\{2\}[0-9]\W[0-9]\{3\}\);.*/\1/g' test.txt 
MQRankSum=-7.801

with awk:

$ awk -F ";" '{print $11}' test.txt 
MQRankSum=-7.801

with cut:

$ cut -d ";" -f11 test.txt 
MQRankSum=-7.801

with grep (MQRankSum is always followed by QD):

$ grep -Po 'MQR.*(?=;QD.*)' test.txt 
MQRankSum=-7.801

.

$ cat test.txt 
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL 0/1:246,99:345:99:2909,0
ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by cpad011210k

I tried to extract values for ReadPosRankSum using same sed command:

cut -f 8 | \
sed 's/^.*;ReadPosRankSum=\(\-\{0,1\}[0-9]\{1,\}.[0-9]*\);.*$/\1/' > ReadPosRankSum.txt

Here are a few lines of the .txt file showing it extracted some but not all of the value even though ReadPosRankSum is present:

0.574
1.757
3.098
AC=1;AF=0.500;AN=2;BaseQRankSum=0.308;DP=274;ExcessHet=3.0103;FS=31.396;MLEAC=1;MLEAF=0.500;MQ=58.45;MQRankSum=-16.410;QD=0.85;ReadPosRankSum=-3.793;SOR=3.537
AC=1;AF=0.500;AN=2;BaseQRankSum=1.851;DP=300;ExcessHet=3.0103;FS=40.009;MLEAC=1;MLEAF=0.500;MQ=58.04;MQRankSum=-14.850;QD=2.70;ReadPosRankSum=-2.540;SOR=3.531
3.074
AC=1;AF=0.500;AN=2;BaseQRankSum=-2.909;DP=400;ExcessHet=3.0103;FS=55.917;MLEAC=1;MLEAF=0.500;MQ=56.63;MQRankSum=-13.338;QD=9.33;ReadPosRankSum=-0.199;SOR=1.422
0.922
1.242
1.698
0.256
ADD REPLYlink written 9 weeks ago by kristina.mahan20
2

You're overcomplicating this. Your values are all delimited by semi-colons, you should use them.

ADD REPLYlink written 9 weeks ago by jrj.healey9.1k
4
gravatar for jrj.healey
9 weeks ago by
jrj.healey9.1k
United Kingdom
jrj.healey9.1k wrote:

Is this all you want? FWIW, the column you claim to want is 11, not 8 (OP clarified)

Data:

$ cat string.txt
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0

Isolate the column:

$ cat string.txt | cut -d ';' -f 11
MQRankSum=-7.801

To get just the value itself:

$ cat string.txt | cut -d ';' -f 11 | sed 's/^.*=//g'
-7.801

This should work for any column you want to extract, since they all end in an = before the signed values. Just change -f 11 to suit.

E.g., as you've also asked for ReadPosRankSum:

$ cat string.txt | cut -d ';' -f 13| sed 's/^.*=//gi'
-1.213
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by jrj.healey9.1k
1

If any of these answers were suffcient OP, be sure to select one or more as "Accepted" by clicking the check mark at the left of the answer itself. You can optionally upvote as many answers and comments as you like too.

ADD REPLYlink written 9 weeks ago by jrj.healey9.1k

The original vcf has 8 columns and in the 8th column- "info" is has this string all together in same column (multiple rows):

AC=1;AF=0.500;AN=2;BaseQRankSum=-2.790;DP=300;ExcessHet=3.0103;FS=3.218;MLEAC=1;MLEAF=0.500;MQ=55.83;MQRankSum=-11.767;QD=15.75;ReadPosRankSum=2.193;SOR=0.890
AC=1;AF=0.500;AN=2;BaseQRankSum=1.257;DP=331;ExcessHet=3.0103;FS=38.264;MLEAC=1;MLEAF=0.500;MQ=54.96;MQRankSum=-15.246;QD=9.22;ReadPosRankSum=-1.003;SOR=1.559
AC=1;AF=0.500;AN=2;BaseQRankSum=0.766;DP=299;ExcessHet=3.0103;FS=75.704;MLEAC=1;MLEAF=0.500;MQ=55.12;MQRankSum=-16.302;QD=5.78;ReadPosRankSum=-0.123;SOR=2.776
AC=1;AF=0.500;AN=2;BaseQRankSum=3.407;DP=288;ExcessHet=3.0103;FS=112.157;MLEAC=1;MLEAF=0.500;MQ=56.57;MQRankSum=-16.680;QD=3.76;ReadPosRankSum=0.701;SOR=5.548
AC=1;AF=0.500;AN=2;BaseQRankSum=2.296;DP=268;ExcessHet=3.0103;FS=92.461;MLEAC=1;MLEAF=0.500;MQ=56.88;MQRankSum=-15.783;QD=2.25;ReadPosRankSum=-0.610;SOR=5.891
AC=1;AF=0.500;AN=2;BaseQRankSum=-1.613;DP=293;ExcessHet=3.0103;FS=40.946;MLEAC=1;MLEAF=0.500;MQ=57.97;MQRankSum=-15.989;QD=0.38;ReadPosRankSum=1.561;SOR=3.723
AC=1;AF=0.500;AN=2;BaseQRankSum=2.168;DP=318;ExcessHet=3.0103;FS=94.957;MLEAC=1;MLEAF=0.500;MQ=56.24;MQRankSum=-16.658;QD=3.77;ReadPosRankSum=-2.989;SOR=3.777
AC=1;AF=0.500;AN=2;BaseQRankSum=2.111;DP=309;ExcessHet=3.0103;FS=138.322;MLEAC=1;MLEAF=0.500;MQ=55.58;MQRankSum=-16.086;QD=5.71;ReadPosRankSum=2.439;SOR=5.316
AC=1;AF=0.500;AN=2;BaseQRankSum=-1.400;DP=321;ExcessHet=3.0103;FS=114.671;MLEAC=1;MLEAF=0.500;MQ=54.93;MQRankSum=-15.105;QD=11.08;ReadPosRankSum=0.060;SOR=3.316
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.599;DP=338;ExcessHet=3.0103;FS=67.185;MLEAC=1;MLEAF=0.500;MQ=55.90;MQRankSum=-12.540;QD=12.96;ReadPosRankSum=-1.176;SOR=1.882
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.887;DP=362;ExcessHet=3.0103;FS=13.619;MLEAC=1;MLEAF=0.500;MQ=57.49;MQRankSum=-8.685;QD=12.36;ReadPosRankSum=0.814;SOR=0.306
AC=1;AF=0.500;AN=2;BaseQRankSum=3.380;DP=447;ExcessHet=3.0103;FS=13.227;MLEAC=1;MLEAF=0.500;MQ=59.20;MQRankSum=-4.250;QD=14.93;ReadPosRankS
ADD REPLYlink written 9 weeks ago by kristina.mahan20
1

Ah I see. That's fine, just use your cut -f 8 command and pipe it to my command above... e.g.

$ cat bigvcffile.txt | cut -f 8 | cut -d ';' -f 11 ...
ADD REPLYlink written 9 weeks ago by jrj.healey9.1k
1

can you interpret the sed command you are suggesting so that I can understand it? Please :)

ADD REPLYlink written 9 weeks ago by kristina.mahan20
1

The sed command works because the result of

$ cat string.txt | cut -d ';' -f 11

Gives the full string between semi-colons:

MQRankSum=-7.801

The sed command then says, taking this string as input substitute (s/) from the start of the string (^), encompassing any character (.) any number of times (*), until the literal = sign is met, then replace this with nothing (globally, though thats not really necessary in this case) (//g).

Maybe this visualisation will help:

String:                MQRankSum=-7.801
What sed sees:        ^.........=

You could equally do it the other way around and capture the decimal instead, but this seemed easier to me.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by jrj.healey9.1k
1

You should say from the start it is a vcf file. Modifying jrj.healey answer:

cat file.vcf | cut -f 8 | cut -d ';' -f 11 | sed 's/^.*=//g'
ADD REPLYlink written 9 weeks ago by h.mon21k

here is a few rows of the original vcf showing that the info column as the 8th column:

unitig_2993_pilon   30545   .   G   T   2880.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0,8546
unitig_2993_pilon   30643   .   CA  C   5424.73 .   AC=1;AF=0.500;AN=2;BaseQRankSum=2.053;DP=432;ExcessHet=3.0103;FS=8.204;MLEAC=1;MLEAF=0.500;MQ=56.61;MQRankSum=-11.833;QD=13.01;ReadPosRankSum=2.192;SOR=0.926   GT:AD:DP:GQ:PL  0/1:232,185:417:99:5462,0,7701
unitig_2993_pilon   30644   .   AG  A   7051.73 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.974;DP=431;ExcessHet=3.0103;FS=9.401;MLEAC=1;MLEAF=0.500;MQ=56.68;MQRankSum=11.770;QD=16.91;ReadPosRankSum=-2.205;SOR=0.660  GT:AD:DP:GQ:PL  0/1:186,231:417:99:7089,0,6022
unitig_2993_pilon   30709   .   AG  A   7972.73 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.370;DP=479;ExcessHet=3.0103;FS=1.108;MLEAC=1;MLEAF=0.500;MQ=57.40;MQRankSum=11.870;QD=17.56;ReadPosRankSum=-0.052;SOR=0.756  GT:AD:DP:GQ:PL  0/1:199,255:454:99:8010,0,6428
ADD REPLYlink written 9 weeks ago by kristina.mahan20
2

I would go with vcf parsing tools like: bgzip VCF and index vcf with tabix, with bcftools:

$ bcftools query -f '%MQRankSum\n' test.vcf.gz
ADD REPLYlink written 9 weeks ago by cpad011210k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 581 users visited in the last hour