Strand Information On Vcf
10.4 years ago
My mpileup output looks like this.

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    xyzzy.W1g7rI9gs2.bam
X    533    .    C    G    25    .    DP=42;VDB=0.0033;AF1=0.5;AC1=1;DP4=20,0,18,0;MQ=20;FQ=26.8;PV4=1,7.1e-22,1,1    GT:PL:GQ    0/1:55,0,60:57
X    537    .    C    T    25    .    DP=44;VDB=0.0042;AF1=0.5;AC1=1;DP4=23,0,20,0;MQ=20;FQ=26.6;PV4=1,4.3e-20,1,0.28    GT:PL:GQ    0/1:55,0,59:57


Is it possible to get strand information? or Do the VCF/BCF provide strand information?

vcf bcftools • 8.2k views
I don't think strand information is relevant in a variant format, as the alleles for a variant should be given for the leading (forward, 5'-3') strand, the same direction as the reference sequence. The opposite strand sequence follows base pairing. I don't think any variant which results in imperfect base pairing is viable.

"should be given", well, I am not sure what you are referring to (what level of generality) but sometimes they are given in both forward and reverse strand (e.g.: Comadran et al. 2012). In the VCF format it seems that nothing is really specified according to http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

In the case of GATK you are right "Note that REF and ALT are always given on the forward strand." From http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

However, this doesn't really mean that it is the case of all data from all sources, in my opinion.

I don't think you can get the strand information, however, the VCF spec says that the GT field can be used to specify the phasing:

GT genotype, encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:

/ : genotype unphased
| : genotype phased


Nevertheless, I don't know the tools handling this 'phasing' property.

Thank you so much @Pierre. I will see if tool handle it or not.