How To Make Samtools Mpileup Produce Genotype Information At Every Base?
1
4
Entering edit mode
8.7 years ago
komal.rathi ★ 4.0k

Hi everyone,

I have a bam file and a snps file for chr1 (hg19) which contains the chromosome number & position for a list of known snps.

$head chr1.snps chr1 1211292 chr1 2069172 chr1 2069681 chr1 2513216 chr1 2553624  Now, I want to compute the genotype information (just like it is given in a vcf format) at all the positions given in my snps file, but I do not want to "call" any snps because these are already known variants. And I think VCF format is a good standard for getting genotype information so I tried using: samtools mpileup -u -l chr1.snps -f hg19.fa sample1_filter_sort.bam | bcftools view - > test1.vcf  So here I feed my .snps file and .bam file to samtools mpileup and create an uncompressed .bcf file which I then pipe into bcftools to get a .vcf file. Here I got 245 entries in my .vcf file, so it found 245/513 snps. But as it is not calling any snps, the output is a bit weird: $head test1.vcf

chr1    1211292    .    C    X    0    .    DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0.000000,0.000000,0.000000,0.000000    PL    0,0,0
chr1    2069172    .    C    X    0    .    DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0.000000,0.000000,0.000000,0.000000    PL    0,0,0
chr1    2069681    .    C    X    0    .    DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=0.000000,0.000000,0.000000,0.000000    PL    0,0,0
chr1    2513216    .    C    T,X    0    .    DP=1;I16=0,0,1,0,0,0,34,1156,0,0,39,1521,0,0,4,16;QS=0.000000,1.000000,0.000000,0.000000    PL    34,3,0,34,3,34
chr1    2553624    .    T    X    0    .    DP=0;I16=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;QS=-nan,0.000000,0.000000,0.000000    PL    0,0,0


In another test I use options to call the variants:

samtools mpileup -u -l chr1.snps -f hg19.fa sample1_filter_sort.bam | bcftools view -v -c -g -l chr1.snps - > test2.vcf


Here, I feed my .snps and .bam file to samtools mpileup and create an uncompressed .bcf file which I then pipe into bcftools to get a .vcf file but this time I am calling the variants using -vcg options. Here, I got 16 entries in my .vcf file so it "called" 16/513 snps. The output here is just like I want it to be:

 \$head test2.vcf

chr1    2513216    .    C    T    5.46    .    DP=1;QS=0.000000,1.000000,0.000000,0.000000;AF1=1;AC1=2;DP4=0,0,1,0;MQ=39;FQ=-30    GT:PL:GQ    1/1:34,3,0:3
chr1    29519778    .    T    G    187    .    DP=8;QS=0.000000,1.000000,0.000000,0.000000;VDB=4.580661e-02;AF1=1;AC1=2;DP4=0,0,4,3;MQ=40;FQ=-48    GT:PL:GQ    1/1:220,21,0:39
chr1    41204569    .    C    T    222    .    DP=91;QS=0.000000,1.000000,0.000000,0.000000;VDB=4.337502e-01;AF1=1;AC1=2;DP4=0,0,47,34;MQ=40;FQ=-271    GT:PL:GQ    1/1:255,244,0:99
chr1    42880516    .    T    C    222    .    DP=11;QS=0.000000,1.000000,0.000000,0.000000;VDB=1.273245e-01;AF1=1;AC1=2;DP4=0,0,5,5;MQ=40;FQ=-57    GT:PL:GQ    1/1:255,30,0:57
chr1    55319902    .    A    G    222    .    DP=20;QS=0.000000,1.000000,0.000000,0.000000;VDB=1.803282e-01;AF1=1;AC1=2;DP4=0,0,12,7;MQ=40;FQ=-84    GT:PL:GQ    1/1:255,57,0:99


So, essentially you can see that there is no genotype information for the snps detected in test1 but there is genotype information in the last column of test2 (because we used variant calling). What I want is that the program should detect the 245 snps just like test1 (I dont want it to call it because I already have the list) but should report the output in a format like test2. How can that be done?

samtools mpileup bcftools • 8.8k views
1
Entering edit mode
8.7 years ago

The question is whether your bam file supports or not the SNPs that you expect to be there.

As it happens the answer is rarely that you find 100%. It all depends on coverage, alignment and the quality of sample.

0
Entering edit mode

Thanks you are right, the question is that but I was also doubtful about my approach. So nothing is wrong with my approach of extracting genotype information for a list of snps from a bam file?

0
Entering edit mode

yes, it should work just fine, most of the time we call snps not even knowing where they are, if you knew them to be true getting the genotype is "easy"

0
Entering edit mode

I have edited my question so that it makes sense a bit

1
Entering edit mode

I have changed the title of your post your question does not really depend on the existance of the SNP file.