Strand Information On Vcf
1
0
Entering edit mode
10.5 years ago
Jirapong ▴ 20

My mpileup output looks like this.

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    xyzzy.W1g7rI9gs2.bam
X    533    .    C    G    25    .    DP=42;VDB=0.0033;AF1=0.5;AC1=1;DP4=20,0,18,0;MQ=20;FQ=26.8;PV4=1,7.1e-22,1,1    GT:PL:GQ    0/1:55,0,60:57
X    537    .    C    T    25    .    DP=44;VDB=0.0042;AF1=0.5;AC1=1;DP4=23,0,20,0;MQ=20;FQ=26.6;PV4=1,4.3e-20,1,0.28    GT:PL:GQ    0/1:55,0,59:57

Is it possible to get strand information? or Do the VCF/BCF provide strand information?

vcf bcftools • 8.2k views
ADD COMMENT
4
Entering edit mode

I don't think strand information is relevant in a variant format, as the alleles for a variant should be given for the leading (forward, 5'-3') strand, the same direction as the reference sequence. The opposite strand sequence follows base pairing. I don't think any variant which results in imperfect base pairing is viable.

ADD REPLY
0
Entering edit mode

"should be given", well, I am not sure what you are referring to (what level of generality) but sometimes they are given in both forward and reverse strand (e.g.: Comadran et al. 2012). In the VCF format it seems that nothing is really specified according to http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41

In the case of GATK you are right "Note that REF and ALT are always given on the forward strand." From http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

However, this doesn't really mean that it is the case of all data from all sources, in my opinion.

ADD REPLY
3
Entering edit mode
10.5 years ago

I don't think you can get the strand information, however, the VCF spec says that the GT field can be used to specify the phasing:

GT genotype, encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:

    / : genotype unphased
    | : genotype phased

Nevertheless, I don't know the tools handling this 'phasing' property.

ADD COMMENT
0
Entering edit mode

Thank you so much @Pierre. I will see if tool handle it or not.

ADD REPLY

Login before adding your answer.

Traffic: 2427 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6