Question: How best to parse samtools mpileup results?
0
gravatar for yarmda
7 weeks ago by
yarmda0
yarmda0 wrote:

I am trying to characterize the depth of coverage for SNPs/INDELs/other variants in my BAM/SAM file. Using samtools mpileup generates a nice VCF that gives me information on this, but it's not quite complete. Entering the "-t AD" flag gives me the depth of the alternate alleles compared to the reference. That is what I want, though the actual depth is found in the column after the flag is indicated and it is colon delimited. It is readable, but not terribly concise.

However, I am trying to check specific positions along the genome for specific alternate alleles. Is there a convenient way to parse the VCF for the AD flag values by position and allele without complicated and format-specific if/else statements?

I can use bedtools intersect or a dozen other methods to reduce the file to specific positions, already, but I'm hoping there's a method that doesn't require piping a handful of tools together.

EDIT:

It would be straightforward to write an if statement of the kind:

if ALT column is T then return AD and if AD is > 20 return "likely to be target organism strain"

However, over entire genomes and many different variant calls, this won't be so trivial. I have a file that contains a list of positions along the genome and alleles that I'm specifically looking for. I guess I should better word my question to ask for some software that performs that level of analysis. I need to compare allele identity at a given position to a file containing many positions and alleles of interest and then return the coverage of those alleles in the sample that match the ones in the comparison file. After that, I need to review the depth to determine the significance of the variant - which is arbitrary.

mpileup vcftools samtools vcf • 355 views
ADD COMMENTlink modified 7 weeks ago by Pierre Lindenbaum89k • written 7 weeks ago by yarmda0
1

BBMap's variant-caller callvariants.sh) will produce a vcf file with the allele depth always in the exact same position, which is relatively easy to parse with no if statements... a typical line is:

Escherichia_coli    547694  A   G   46.830  PASS    SN=0;STA=547693;STO=547694;TYP=SUB;R1P=11;R1M=15;R2P=18;R2M=13;PPC=57;LS=8037;MQS=2423;MQM=44;BQS=1870;BQM=37;EDS=2196;EDM=70;IDS=56235;IDM=992;COV=33;MCOV=17;HMP=2;DP=57;AF=1.0000;DP4=-13,-11,29,28  GT:DP:AD:AF 1:33:33:1.0000  1:24:24:1.0000

The total coverage here (DP) is 33 for the first sample, and allele depth (AD) is also 33 for the first sample. So, allele depth is the the 3rd colon-delimited field of the 9th tab-delimited field.

ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Brian Bushnell9.8k

That doesn't look too much different from the pileup output. The greater difficulty I'm having isn't parsing out the AD feature, but doing so when comparing the alternate allele value at that point.

It would be straightforward to write an if statement of the kind:

if ALT column is T then return AD and if AD is > 20 return "likely to be target organism strain"

However, over entire genomes and many different variant calls, this won't be so trivial. I guess I should better word my question to ask for some software that performs that level of analysis. I need to compare allele identity at a given position to a file containing many positions and alleles of interest and then return the coverage of those alleles in the sample that match the ones in the comparison file. After that, I need to review the depth to determine the significance of the variant - which is arbitrary.

ADD REPLYlink written 7 weeks ago by yarmda0
3
gravatar for Pierre Lindenbaum
7 weeks ago by
France
Pierre Lindenbaum89k wrote:

It would be straightforward to write an if statement of the kind if ALT column is T then return AD and if AD is > 20 return "likely to be target organism strain"

using vcffilterjs https://github.com/lindenb/jvarkit/wiki/VCFFilterJS (will add NOT_TARGET_ORGANISM_STRAIN in the column FILTER)

 java -jar dist/vcffilterjs.jar -e 'function accept(v) {var foundT=false;var alts= v.getAlternateAlleles(); for(i=0;i< alts.size();++i) {if(alts.get(i).getDisplayString().equals("T")) {foundT=true;break;}} if(!foundT) return false; var g0=v.getGenotype(0); if(!g0.hasAD()) return false; var AD=g0.getAD(); for(var i in AD) if(AD[i]>20) return true; return false;} accept(variant);' -F NOT_TARGET_ORGANISM_STRAIN input.vcf
ADD COMMENTlink written 7 weeks ago by Pierre Lindenbaum89k

This looks really close to what I want. Is there a way I can feed the expression a file containing a series of positions with alleles of interest and a minimum AD?

ADD REPLYlink written 7 weeks ago by yarmda0
2

you can create a loop with position/AD-treshold , select the portion of VCF using becftools and pipe it into java -jar dist/vcffilterjs.jar

ADD REPLYlink written 7 weeks ago by Pierre Lindenbaum89k

That's good thinking. I considered the loop, but hadn't realized how I could avoid loading the entire vcf each iteration.

ADD REPLYlink written 7 weeks ago by yarmda0

Can you describe how to pass a shell variable to the javascript expression?

ADD REPLYlink written 7 weeks ago by yarmda0

use double quotes instead of simple quotes

ADD REPLYlink written 7 weeks ago by Pierre Lindenbaum89k

I tried that to no success. Below I define the variable allele to be T, the same as the original expression. I have replaced "T" in getDisplayString().equals("T") with "$allele", but now everything is considered false. Have I made a mistake?

allele="T"
echo $allele
java -jar dist/vcffilterjs.jar -e 'function accept(v) {var foundT=false;var alts= v.getAlternateAlleles(); for(i=0;i< alts.size();++i) {if(alts.get(i).getDisplayString().equals("$allele")) {foundT=true;break;}} if(!foundT) return false; var g0=v.getGenotype(0); if(!g0.hasAD()) return false; var AD=g0.getAD(); for(var i in AD) if(AD[i]>20) return true; return false;} accept(variant);' -F NOT_TARGET_ORGANISM_STRAIN input.vcf
ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by yarmda0
1

use double quotes instead of simple quotes

... -e "function accept(v.... .equals(\"${allele}\")...."
ADD REPLYlink written 7 weeks ago by Pierre Lindenbaum89k

This has been exceedingly helpful. Now, if I wanted to print the AD and the DP instead of the INFO column and instead of filtering on the AD - would that be possible?

ADD REPLYlink written 7 weeks ago by yarmda0

How does this work behind the scenes? I'm finding the order of the PL:DP:AD flags is reversed when the FILTER lists "PASS" compared to when it is "NOT_TARGET_ORGANISM_STRAIN"

ADD REPLYlink written 7 weeks ago by yarmda0
1

The writing is handled by the java hts library. (htsjdk)

ADD REPLYlink written 7 weeks ago by Pierre Lindenbaum89k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 475 users visited in the last hour