Questions In Output Of Varscan 2.3.2
Entering edit mode
8.5 years ago
rahilsethi • 0


I have some questions in the output of mpileup2snp and mpileup2indel. For substitutions obtained from mpileup2snp when viewed in VCF format, whether they have genotype 0/1 (HET=1) or genotype 1/1 (HOM=1) all of them have only allele mentioned in ALT column. When I pulled out the %age distribution of bases in there subsitution sites they do not correspond the the cutoff frequency of 0.75 to be called as homozygote. For example, in the following variant:

Sample1 chr7    142460462       .       C       T       .       PASS     ADP=6;WT=0;HET=1;HOM=0;NC=0     GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR    0/1:2:33:6:1:5

To obtain the above variant the --min-freq-for-hom value taken was default (i.e. 0.75). When I looked at the allele distribution for the above putative variant position using samtools mpileup, it came out to be the following:

%A        %T        %G        %C        %Ins        %Del
0        75.75757576        0        24.24242424        0        0

Here %T is > 75% so it should have been called Homozygote but VarScan calls it Heterozygote mentions only one of the ALT allele, instead of both. The mpileup output for this position is:

chr7    142460462       N       33
tttttt$ttccctTT+2ACttctttctttctttctctt  IIIIIIIIH9GIIIIIGIIIHICIHIIIHIGII

Another example is:

chr7_gl000195_random    1       .       G       A       .       PASS   

Base % distribution is:

%A        %T        %G        %C        %Ins        %Del
100        0        0        0        0        0

and mpileup is:

chr7_gl000195_random    1       N       5

which clearly shows homozygotic substitution, unlike what is shown in VarScan output (HET=1; 0/1 genotype)

Regarding Indels:

VarScan VCF format output:

chr1    121107133       .       T       +T      .       PASS

%base distribution:

%A        %T        %G        %C        %Ins        %Del
0        0        0        99.18864097        0.811359026        0

the FREQ value in the above VarScan output is 100% which contradicts the %base distribution (%Ins = 0.811359026). The value of --min-var-freq that I took for the above run is 0.05 and yet it not filtered from the output. The mpileup for the above is:

chr1    121107133       N       483

chr1    121107134       N       489

chr1    121107135       N       489


Why there is discrepancy in the above cases? For %Ins (Insertion) and %Del (Deletion) I calculated for 1 base after the InDel position showed in VCF format because in standard VCF format In-Dels are shown in 1 base before the insertion.

I took the following parameter values to run VarScan mpileup2snp and mpileup2ins:

samtools mpileup -f ref bamfile.bam > output_mpileup

java -Xmx6g -jar ~/VarScan_2_3_2/VarScan.v2.3.2.jar
mpileup2snp output_mpileup \
--min-coverage 4 \
--min-reads2 4 \
--min-avg-qual 20 \
--min-var-freq 0.05 \
--p-value 0.05 \
--output-vcf 1 > output_varscan_snp.vcf

java -Xmx6g -jar ~/VarScan_2_3_2/VarScan.v2.3.2.jar
mpileup2indel output_mpileup \
--min-coverage 4 \
--min-reads2 4 \
--min-avg-qual 20 \
--min-var-freq 0.05 \
--p-value 0.05 \
--output-vcf 1 > output_varscan_indel.vcf

Thanks, Rahil

snp • 6.8k views
Entering edit mode
8.4 years ago
dankoboldt ▴ 140


Thank you for these questions, and for providing the details I need to help answer them! First of all, I see that some of your pileup examples have "N" as the reference base. This is usually due to not providing a reference (-f) to SAMtools mpileup, and it causes lots of difficulties for accurately counting reads. If you did provide a reference base when running mpileup for VarScan, but didn't here, please let me know (and provide those mpileups) as it will change things.

Regarding Question 1 (chr7 142460462) VarScan's read counts differ slightly from SAMtools; it sees 8 reads (23.5%) supporting C and 25 reads (73.5%) supporting T, which is why it makes a heterozygous call.

Regarding Question 2 (chr7gl000195random) The issue is caused by (1) the lack of a reference base and (2) the presence of indels. Because the reference base is N, VarScan sees two things in each read: a substitution (A) and an insertion (2, 4, or 15 bases). So it's double-counting the reads when computing the denominator for VAF. I will look into whether or not I can change the code to address such situations without breaking other parts of read counting / VAF calculation.

Regarding Question 3 (chr1 121107133) This one is a little puzzling, but I think is caused by the N as the reference. Because that happens, the true reference base (T) is seen as a substitution. The VAF for each observed variant allele is always computed as supportingreads / (referencereads + supporting_reads). In the case of the insertion, it therefore only sees 4 insertion-supporting reads and 0 reference-supporting reads, so it calls a homozygous insertion.

Entering edit mode

Hi Dan,

Thanks for the reply. Regarding your concerns:

"If you did provide a reference base when running mpileup for VarScan, but didn't here, please let me know" I did provide the reference sequence for mpileup running in default settings as explained in VarScan manual. Here I showed mpileup output without reference so it is easy to count bases instead of dealing with dots and commas. Otherwise mpileup output in above examples should be the same as used in VarScan. Regarding my follow-up questions:

Follow up on Question 1 (chr7 142460462): I still don't get it how VarScan is calculating those percentages. Can you please elaborate? According to mipleup there are 33 total reads (after passing through default filter) mapping at this position. So 8/33 for C = 0.2424 (24.24%) and remaining 25/33 for T = 0.7575 (75.75%). Moreover the %ages you showed 23.5 and 73.5 do not add up to 100%

Regarding Question 2 and 3: Please explain this if reference base was provided


Login before adding your answer.

Traffic: 2398 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6