Debugging a sed script to extract values from rows in .txt file
1
0
Entering edit mode
3.2 years ago

I am trying to extract out all the values for a specific quality filter MQRankSum. Someone has given a sed script showing how they did it.

Here is one row of my .txt file all located in column 8:

AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0

I am trying to extract out the values only of MQRankSum which. The sed script provided online was:

cut -f 8 | \
sed 's/^.*;MQRankSum=\(\-\{0,1\}[0-9]\{1,\}.[0-9]*\);.*$/\1/' > MQRankSum.txt

When I used that sed command I mostly extracted the values for MQRankSum but also ended left with rows of text that was missing a notation for MQRankSum:

0.000
AC=2;AF=1.00;AN=2;DP=195;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=0.915
-0.254
0.377
1.943
AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=13.87;SOR=2.303
-1.926
-14.951
-4.042
-7.347
-9.536
-3.781
0.637

I tried to debug the sed script but I am having trouble. I want to graph the MQRankSum values but cannot with the additional text values. What is missing from the sed script that will allow only numbers to pass through to the final .txt file?

sed regex unix • 951 views
ADD COMMENT
2
Entering edit mode

with sed:

$ sed 's/.*;\(MQ[A-Za-z]\+\W\{2\}[0-9]\W[0-9]\{3\}\);.*/\1/g' test.txt 
MQRankSum=-7.801

with awk:

$ awk -F ";" '{print $11}' test.txt 
MQRankSum=-7.801

with cut:

$ cut -d ";" -f11 test.txt 
MQRankSum=-7.801

with grep (MQRankSum is always followed by QD):

$ grep -Po 'MQR.*(?=;QD.*)' test.txt 
MQRankSum=-7.801

.

$ cat test.txt 
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL 0/1:246,99:345:99:2909,0
ADD REPLY
0
Entering edit mode

I tried to extract values for ReadPosRankSum using same sed command:

cut -f 8 | \
sed 's/^.*;ReadPosRankSum=\(\-\{0,1\}[0-9]\{1,\}.[0-9]*\);.*$/\1/' > ReadPosRankSum.txt

Here are a few lines of the .txt file showing it extracted some but not all of the value even though ReadPosRankSum is present:

0.574
1.757
3.098
AC=1;AF=0.500;AN=2;BaseQRankSum=0.308;DP=274;ExcessHet=3.0103;FS=31.396;MLEAC=1;MLEAF=0.500;MQ=58.45;MQRankSum=-16.410;QD=0.85;ReadPosRankSum=-3.793;SOR=3.537
AC=1;AF=0.500;AN=2;BaseQRankSum=1.851;DP=300;ExcessHet=3.0103;FS=40.009;MLEAC=1;MLEAF=0.500;MQ=58.04;MQRankSum=-14.850;QD=2.70;ReadPosRankSum=-2.540;SOR=3.531
3.074
AC=1;AF=0.500;AN=2;BaseQRankSum=-2.909;DP=400;ExcessHet=3.0103;FS=55.917;MLEAC=1;MLEAF=0.500;MQ=56.63;MQRankSum=-13.338;QD=9.33;ReadPosRankSum=-0.199;SOR=1.422
0.922
1.242
1.698
0.256
ADD REPLY
2
Entering edit mode

You're overcomplicating this. Your values are all delimited by semi-colons, you should use them.

ADD REPLY
4
Entering edit mode
3.2 years ago
Joe 19k

Is this all you want? FWIW, the column you claim to want is 11, not 8 (OP clarified)

Data:

$ cat string.txt
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0

Isolate the column:

$ cat string.txt | cut -d ';' -f 11
MQRankSum=-7.801

To get just the value itself:

$ cat string.txt | cut -d ';' -f 11 | sed 's/^.*=//g'
-7.801

This should work for any column you want to extract, since they all end in an = before the signed values. Just change -f 11 to suit.

E.g., as you've also asked for ReadPosRankSum:

$ cat string.txt | cut -d ';' -f 13| sed 's/^.*=//gi'
-1.213
ADD COMMENT
1
Entering edit mode

If any of these answers were suffcient OP, be sure to select one or more as "Accepted" by clicking the check mark at the left of the answer itself. You can optionally upvote as many answers and comments as you like too.

ADD REPLY
0
Entering edit mode

The original vcf has 8 columns and in the 8th column- "info" is has this string all together in same column (multiple rows):

AC=1;AF=0.500;AN=2;BaseQRankSum=-2.790;DP=300;ExcessHet=3.0103;FS=3.218;MLEAC=1;MLEAF=0.500;MQ=55.83;MQRankSum=-11.767;QD=15.75;ReadPosRankSum=2.193;SOR=0.890
AC=1;AF=0.500;AN=2;BaseQRankSum=1.257;DP=331;ExcessHet=3.0103;FS=38.264;MLEAC=1;MLEAF=0.500;MQ=54.96;MQRankSum=-15.246;QD=9.22;ReadPosRankSum=-1.003;SOR=1.559
AC=1;AF=0.500;AN=2;BaseQRankSum=0.766;DP=299;ExcessHet=3.0103;FS=75.704;MLEAC=1;MLEAF=0.500;MQ=55.12;MQRankSum=-16.302;QD=5.78;ReadPosRankSum=-0.123;SOR=2.776
AC=1;AF=0.500;AN=2;BaseQRankSum=3.407;DP=288;ExcessHet=3.0103;FS=112.157;MLEAC=1;MLEAF=0.500;MQ=56.57;MQRankSum=-16.680;QD=3.76;ReadPosRankSum=0.701;SOR=5.548
AC=1;AF=0.500;AN=2;BaseQRankSum=2.296;DP=268;ExcessHet=3.0103;FS=92.461;MLEAC=1;MLEAF=0.500;MQ=56.88;MQRankSum=-15.783;QD=2.25;ReadPosRankSum=-0.610;SOR=5.891
AC=1;AF=0.500;AN=2;BaseQRankSum=-1.613;DP=293;ExcessHet=3.0103;FS=40.946;MLEAC=1;MLEAF=0.500;MQ=57.97;MQRankSum=-15.989;QD=0.38;ReadPosRankSum=1.561;SOR=3.723
AC=1;AF=0.500;AN=2;BaseQRankSum=2.168;DP=318;ExcessHet=3.0103;FS=94.957;MLEAC=1;MLEAF=0.500;MQ=56.24;MQRankSum=-16.658;QD=3.77;ReadPosRankSum=-2.989;SOR=3.777
AC=1;AF=0.500;AN=2;BaseQRankSum=2.111;DP=309;ExcessHet=3.0103;FS=138.322;MLEAC=1;MLEAF=0.500;MQ=55.58;MQRankSum=-16.086;QD=5.71;ReadPosRankSum=2.439;SOR=5.316
AC=1;AF=0.500;AN=2;BaseQRankSum=-1.400;DP=321;ExcessHet=3.0103;FS=114.671;MLEAC=1;MLEAF=0.500;MQ=54.93;MQRankSum=-15.105;QD=11.08;ReadPosRankSum=0.060;SOR=3.316
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.599;DP=338;ExcessHet=3.0103;FS=67.185;MLEAC=1;MLEAF=0.500;MQ=55.90;MQRankSum=-12.540;QD=12.96;ReadPosRankSum=-1.176;SOR=1.882
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.887;DP=362;ExcessHet=3.0103;FS=13.619;MLEAC=1;MLEAF=0.500;MQ=57.49;MQRankSum=-8.685;QD=12.36;ReadPosRankSum=0.814;SOR=0.306
AC=1;AF=0.500;AN=2;BaseQRankSum=3.380;DP=447;ExcessHet=3.0103;FS=13.227;MLEAC=1;MLEAF=0.500;MQ=59.20;MQRankSum=-4.250;QD=14.93;ReadPosRankS
ADD REPLY
1
Entering edit mode

Ah I see. That's fine, just use your cut -f 8 command and pipe it to my command above... e.g.

$ cat bigvcffile.txt | cut -f 8 | cut -d ';' -f 11 ...
ADD REPLY
1
Entering edit mode

can you interpret the sed command you are suggesting so that I can understand it? Please :)

ADD REPLY
1
Entering edit mode

The sed command works because the result of

$ cat string.txt | cut -d ';' -f 11

Gives the full string between semi-colons:

MQRankSum=-7.801

The sed command then says, taking this string as input substitute (s/) from the start of the string (^), encompassing any character (.) any number of times (*), until the literal = sign is met, then replace this with nothing (globally, though thats not really necessary in this case) (//g).

Maybe this visualisation will help:

String:                MQRankSum=-7.801
What sed sees:        ^.........=

You could equally do it the other way around and capture the decimal instead, but this seemed easier to me.

ADD REPLY
1
Entering edit mode

You should say from the start it is a vcf file. Modifying jrj.healey answer:

cat file.vcf | cut -f 8 | cut -d ';' -f 11 | sed 's/^.*=//g'
ADD REPLY
0
Entering edit mode

here is a few rows of the original vcf showing that the info column as the 8th column:

unitig_2993_pilon   30545   .   G   T   2880.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0,8546
unitig_2993_pilon   30643   .   CA  C   5424.73 .   AC=1;AF=0.500;AN=2;BaseQRankSum=2.053;DP=432;ExcessHet=3.0103;FS=8.204;MLEAC=1;MLEAF=0.500;MQ=56.61;MQRankSum=-11.833;QD=13.01;ReadPosRankSum=2.192;SOR=0.926   GT:AD:DP:GQ:PL  0/1:232,185:417:99:5462,0,7701
unitig_2993_pilon   30644   .   AG  A   7051.73 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-1.974;DP=431;ExcessHet=3.0103;FS=9.401;MLEAC=1;MLEAF=0.500;MQ=56.68;MQRankSum=11.770;QD=16.91;ReadPosRankSum=-2.205;SOR=0.660  GT:AD:DP:GQ:PL  0/1:186,231:417:99:7089,0,6022
unitig_2993_pilon   30709   .   AG  A   7972.73 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.370;DP=479;ExcessHet=3.0103;FS=1.108;MLEAC=1;MLEAF=0.500;MQ=57.40;MQRankSum=11.870;QD=17.56;ReadPosRankSum=-0.052;SOR=0.756  GT:AD:DP:GQ:PL  0/1:199,255:454:99:8010,0,6428
ADD REPLY
2
Entering edit mode

I would go with vcf parsing tools like: bgzip VCF and index vcf with tabix, with bcftools:

$ bcftools query -f '%MQRankSum\n' test.vcf.gz
ADD REPLY

Login before adding your answer.

Traffic: 2328 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6