Debugging a sed script to extract values from rows in .txt file
1
0
Entering edit mode
3.2 years ago

I am trying to extract out all the values for a specific quality filter MQRankSum. Someone has given a sed script showing how they did it.

Here is one row of my .txt file all located in column 8:

AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0


I am trying to extract out the values only of MQRankSum which. The sed script provided online was:

cut -f 8 | \
sed 's/^.*;MQRankSum=$$\-\{0,1\}[0-9]\{1,\}.[0-9]*$$;.*$/\1/' > MQRankSum.txt  When I used that sed command I mostly extracted the values for MQRankSum but also ended left with rows of text that was missing a notation for MQRankSum: 0.000 AC=2;AF=1.00;AN=2;DP=195;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=31.25;SOR=0.915 -0.254 0.377 1.943 AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=13.87;SOR=2.303 -1.926 -14.951 -4.042 -7.347 -9.536 -3.781 0.637  I tried to debug the sed script but I am having trouble. I want to graph the MQRankSum values but cannot with the additional text values. What is missing from the sed script that will allow only numbers to pass through to the final .txt file? sed regex unix • 951 views ADD COMMENT 2 Entering edit mode with sed: $ sed 's/.*;$$MQ[A-Za-z]\+\W\{2\}[0-9]\W[0-9]\{3\}$$;.*/\1/g' test.txt
MQRankSum=-7.801


with awk:

$awk -F ";" '{print$11}' test.txt
MQRankSum=-7.801


with cut:

$cut -d ";" -f11 test.txt MQRankSum=-7.801  with grep (MQRankSum is always followed by QD): $ grep -Po 'MQR.*(?=;QD.*)' test.txt
MQRankSum=-7.801


.

$cat test.txt AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL 0/1:246,99:345:99:2909,0  ADD REPLY 0 Entering edit mode I tried to extract values for ReadPosRankSum using same sed command: cut -f 8 | \ sed 's/^.*;ReadPosRankSum=$$\-\{0,1\}[0-9]\{1,\}.[0-9]*$$;.*$/\1/' > ReadPosRankSum.txt


Here are a few lines of the .txt file showing it extracted some but not all of the value even though ReadPosRankSum is present:

0.574
1.757
3.098
3.074
0.922
1.242
1.698
0.256

2
Entering edit mode

You're overcomplicating this. Your values are all delimited by semi-colons, you should use them.

4
Entering edit mode
3.2 years ago
Joe 19k

Is this all you want? FWIW, the column you claim to want is 11, not 8 (OP clarified)

Data:

$cat string.txt AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL 0/1:246,99:345:99:2909,0  Isolate the column: $ cat string.txt | cut -d ';' -f 11
MQRankSum=-7.801


To get just the value itself:

$cat string.txt | cut -d ';' -f 11 | sed 's/^.*=//g' -7.801  This should work for any column you want to extract, since they all end in an = before the signed values. Just change -f 11 to suit. E.g., as you've also asked for ReadPosRankSum: $ cat string.txt | cut -d ';' -f 13| sed 's/^.*=//gi'
-1.213

1
Entering edit mode

If any of these answers were suffcient OP, be sure to select one or more as "Accepted" by clicking the check mark at the left of the answer itself. You can optionally upvote as many answers and comments as you like too.

0
Entering edit mode

The original vcf has 8 columns and in the 8th column- "info" is has this string all together in same column (multiple rows):

AC=1;AF=0.500;AN=2;BaseQRankSum=-2.790;DP=300;ExcessHet=3.0103;FS=3.218;MLEAC=1;MLEAF=0.500;MQ=55.83;MQRankSum=-11.767;QD=15.75;ReadPosRankSum=2.193;SOR=0.890

1
Entering edit mode

Ah I see. That's fine, just use your cut -f 8 command and pipe it to my command above... e.g.

$cat bigvcffile.txt | cut -f 8 | cut -d ';' -f 11 ...  ADD REPLY 1 Entering edit mode can you interpret the sed command you are suggesting so that I can understand it? Please :) ADD REPLY 1 Entering edit mode The sed command works because the result of $ cat string.txt | cut -d ';' -f 11


Gives the full string between semi-colons:

MQRankSum=-7.801


The sed command then says, taking this string as input substitute (s/) from the start of the string (^), encompassing any character (.) any number of times (*), until the literal = sign is met, then replace this with nothing (globally, though thats not really necessary in this case) (//g).

Maybe this visualisation will help:

String:                MQRankSum=-7.801
What sed sees:        ^.........=


You could equally do it the other way around and capture the decimal instead, but this seemed easier to me.

1
Entering edit mode

You should say from the start it is a vcf file. Modifying jrj.healey answer:

cat file.vcf | cut -f 8 | cut -d ';' -f 11 | sed 's/^.*=//g'

0
Entering edit mode

here is a few rows of the original vcf showing that the info column as the 8th column:

unitig_2993_pilon   30545   .   G   T   2880.77 .   AC=1;AF=0.500;AN=2;BaseQRankSum=-0.181;DP=350;ExcessHet=3.0103;FS=134.905;MLEAC=1;MLEAF=0.500;MQ=50.03;MQRankSum=-7.801;QD=8.35;ReadPosRankSum=-1.213;SOR=4.021 GT:AD:DP:GQ:PL  0/1:246,99:345:99:2909,0,8546

2
Entering edit mode

I would go with vcf parsing tools like: bgzip VCF and index vcf with tabix, with bcftools:

\$ bcftools query -f '%MQRankSum\n' test.vcf.gz