**140**wrote:

I have an artificial ref file against which I have mapped artificial reads to understand some things about VCF, calls etc. ...

The reference file that looks like this:

>myseq_contig1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNCCTCGGAAACTCTTAATGGCACATAAACGACCGAATCAATGACTCACCTT TGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAATCCAAGGTCT CAGACCACTTCGGACCGAGATGCTTCGCATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

This ref sequence is 270 in length. The first 70 sites of the reference are "N". Then there's a sequence (140 * [ACGT]). Then again 60 "N".

This is the SAM file of the mapping:

left_1 0 myseq_contig1 1 60 100M * 0 0 ACGTACGACTGACTGAGCGGGCGCATCTATTGCATTCGATCGGAACGATCGATCGACAGCGACTAGCATCCCTCGGAAACTCTTAATGGCACATAAACGA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:30 AS:i:30 XS:i:0 left_2 0 myseq_contig1 1 60 100M * 0 0 ACGTACGACTGACTGAGCGGGCGAATCTATTGCATTCGATTGGAACGATCGATCGAGAGCGACTAGCATCCCTCGGAAACTCTTAATGGCACATAAACGA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:30 AS:i:30 XS:i:0 read1_101 0 myseq_contig1 71 60 100M * 0 0 CCTCGGAAACTCTTAATGGCACATAAACGACCGAATCAATGACTCACCTTTGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0 read21_121 0 myseq_contig1 91 60 100M * 0 0 ACATAAACGACCGAATCAATGACTCACCTTTGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAATCCAAGGTCTCAGACCACTT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0 read41_141 0 myseq_contig1 111 60 100M * 0 0 GACTCACCTTTGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAATCCAAGGTCTCAGACCACTTCGGACCGAGATGCTTCGCAT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0 right1 0 myseq_contig1 171 60 100M * 0 0 TCCAAGGTCTCAGACCACTTCGGACCGAGATGCTTCGCATCCGAGTATCTCGGCAAAGTTATAGGCGTGGGGGCTACGCTCCAAGGCGTTATTTACTTAT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:40 AS:i:40 XS:i:0

I have 3 reads that form a contig at the first non-"N" position in the reference, their read names are

read1_101, read21_121 and read41_141.

The reads left_1 and left_2 both start at the first "N". They also both perfectly match the first 30 non-"N"-sites in the reference, i.e. they both span the "N"-region at the beginning and then reach into the non-"N"-sequence-part.

The last read (right1) starts at position 171, which is 40 positions before the trailing "N"-region. It reaches until the end, i.e. until the last "N" in the reference.

**Now I am confused about the output of this command:**

samtools view -h mapping.sam | samtools mpileup -vf ref.fasta - | bcftools call -Mc -Oz - | bcftools view | less -S

The output looks like this (cut-out because my post is too long):

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT - myseq_contig1 211 . N C 6.98265 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9912 GT:PL 1/1:36,3,0 myseq_contig1 212 . N C 4.77219 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9923 GT:PL 0/1:33,3,0 myseq_contig1 213 . N G 3.54557 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9935 GT:PL 0/1:31,3,0 myseq_contig1 214 . N A 3.01618 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9944 GT:PL 0/1:30,3,0 myseq_contig1 215 . N . 3.53256 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9956 GT:PL 0/0:29 myseq_contig1 216 . N . 4.11733 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.997 GT:PL 0/0:28

Now I see the following genotypes (last column, column header = "-") at the respective reference-sites:

211: 1/1

212: 0/1

213: 0/1

214: 0/1

211 is where the "N" region starts at the end of the reference. So here he calls the variant 1.

But from position 215 onwards he starts to call 0/0 (which is reference which is "N") until the end.

But positions 215 until end vs positions 211-214 share the same condition: reference is "N" and there is one read (namely "right1") that covers the position. Why does he make different decisions when the conditions are the same?