How best to parse samtools mpileup results?
1
1
Entering edit mode
6.1 years ago
yarmda ▴ 40

I am trying to characterize the depth of coverage for SNPs/INDELs/other variants in my BAM/SAM file. Using samtools mpileup generates a nice VCF that gives me information on this, but it's not quite complete. Entering the "-t AD" flag gives me the depth of the alternate alleles compared to the reference. That is what I want, though the actual depth is found in the column after the flag is indicated and it is colon delimited. It is readable, but not terribly concise.

However, I am trying to check specific positions along the genome for specific alternate alleles. Is there a convenient way to parse the VCF for the AD flag values by position and allele without complicated and format-specific if/else statements?

I can use bedtools intersect or a dozen other methods to reduce the file to specific positions, already, but I'm hoping there's a method that doesn't require piping a handful of tools together.

EDIT:

It would be straightforward to write an if statement of the kind:

if ALT column is T then return AD and if AD is > 20 return "likely to be target organism strain"

However, over entire genomes and many different variant calls, this won't be so trivial. I have a file that contains a list of positions along the genome and alleles that I'm specifically looking for. I guess I should better word my question to ask for some software that performs that level of analysis. I need to compare allele identity at a given position to a file containing many positions and alleles of interest and then return the coverage of those alleles in the sample that match the ones in the comparison file. After that, I need to review the depth to determine the significance of the variant - which is arbitrary.

samtools mpileup vcf vcftools • 4.1k views
1
Entering edit mode

BBMap's variant-caller callvariants.sh) will produce a vcf file with the allele depth always in the exact same position, which is relatively easy to parse with no if statements... a typical line is:

Escherichia_coli    547694  A   G   46.830  PASS    SN=0;STA=547693;STO=547694;TYP=SUB;R1P=11;R1M=15;R2P=18;R2M=13;PPC=57;LS=8037;MQS=2423;MQM=44;BQS=1870;BQM=37;EDS=2196;EDM=70;IDS=56235;IDM=992;COV=33;MCOV=17;HMP=2;DP=57;AF=1.0000;DP4=-13,-11,29,28  GT:DP:AD:AF 1:33:33:1.0000  1:24:24:1.0000

The total coverage here (DP) is 33 for the first sample, and allele depth (AD) is also 33 for the first sample. So, allele depth is the the 3rd colon-delimited field of the 9th tab-delimited field.

0
Entering edit mode

That doesn't look too much different from the pileup output. The greater difficulty I'm having isn't parsing out the AD feature, but doing so when comparing the alternate allele value at that point.

It would be straightforward to write an if statement of the kind:

if ALT column is T then return AD and if AD is > 20 return "likely to be target organism strain"

However, over entire genomes and many different variant calls, this won't be so trivial. I guess I should better word my question to ask for some software that performs that level of analysis. I need to compare allele identity at a given position to a file containing many positions and alleles of interest and then return the coverage of those alleles in the sample that match the ones in the comparison file. After that, I need to review the depth to determine the significance of the variant - which is arbitrary.

3
Entering edit mode
6.1 years ago

It would be straightforward to write an if statement of the kind if ALT column is T then return AD and if AD is > 20 return "likely to be target organism strain"

using vcffilterjs https://github.com/lindenb/jvarkit/wiki/VCFFilterJS (will add NOT_TARGET_ORGANISM_STRAIN in the column FILTER)

java -jar dist/vcffilterjs.jar -e 'function accept(v) {var foundT=false;var alts= v.getAlternateAlleles(); for(i=0;i< alts.size();++i) {if(alts.get(i).getDisplayString().equals("T")) {foundT=true;break;}} if(!foundT) return false; var g0=v.getGenotype(0); if(!g0.hasAD()) return false; var AD=g0.getAD(); for(var i in AD) if(AD[i]>20) return true; return false;} accept(variant);' -F NOT_TARGET_ORGANISM_STRAIN input.vcf
0
Entering edit mode

This looks really close to what I want. Is there a way I can feed the expression a file containing a series of positions with alleles of interest and a minimum AD?

2
Entering edit mode

you can create a loop with position/AD-treshold , select the portion of VCF using becftools and pipe it into java -jar dist/vcffilterjs.jar

0
Entering edit mode

That's good thinking. I considered the loop, but hadn't realized how I could avoid loading the entire vcf each iteration.

0
Entering edit mode

Can you describe how to pass a shell variable to the javascript expression?

0
Entering edit mode

use double quotes instead of simple quotes

0
Entering edit mode

I tried that to no success. Below I define the variable allele to be T, the same as the original expression. I have replaced "T" in getDisplayString().equals("T") with "$allele", but now everything is considered false. Have I made a mistake? allele="T" echo$allele
java -jar dist/vcffilterjs.jar -e 'function accept(v) {var foundT=false;var alts= v.getAlternateAlleles(); for(i=0;i< alts.size();++i) {if(alts.get(i).getDisplayString().equals("$allele")) {foundT=true;break;}} if(!foundT) return false; var g0=v.getGenotype(0); if(!g0.hasAD()) return false; var AD=g0.getAD(); for(var i in AD) if(AD[i]>20) return true; return false;} accept(variant);' -F NOT_TARGET_ORGANISM_STRAIN input.vcf ADD REPLY 1 Entering edit mode use double quotes instead of simple quotes ... -e "function accept(v.... .equals(\"${allele}\")...."
0
Entering edit mode

This has been exceedingly helpful. Now, if I wanted to print the AD and the DP instead of the INFO column and instead of filtering on the AD - would that be possible?

0
Entering edit mode

How does this work behind the scenes? I'm finding the order of the PL:DP:AD flags is reversed when the FILTER lists "PASS" compared to when it is "NOT_TARGET_ORGANISM_STRAIN"

1
Entering edit mode

The writing is handled by the java hts library. (htsjdk)