(Skip this section if you're in a hurry!)
This query is about the rules/conventions for annotating the positions of indel variants in genotyping data.
The query is phrased in terms of the field (or column) names used by Illumina in the manifest files it provides for its genotyping chips. Still, I hope that this question could be answered even by people who are not particularly well-versed on Illumina's genotyping chip manifest files. More specifically, I hope that (a) most of Illumina field names (e.g. SourceSeq, Chr) are sufficiently self-explanatory (when there's a chance they may not be, I will supply definitions); and (b) that Illumina's position annotations for indels follow rules/conventions that are generally accepted in bioinformatics.
(Whatever "data" or "metadata" appears in this question is fully made up.)
Suppose that a the manifest file record for a probe in an Illumina genotyping chip has a SourceSeq field featuring the following pattern
...where the "..." indicate that the sequences flanking the variant's locus in the SourceSeq field are (typically) longer.
From this information (and, more specifically, from the hyphen before the forward slash), we can tell that this variant is an indel. It is not clear from this alone, however, whether it is an insertion or a deletion (relative to the reference genome).
Furthermore, assume that the MapInfo field (i.e. the position of the variant in its chromosome) for this probe's record has value 1000000 (just an easy-to-remember number, by way of example). The value of the Chr field is irrelevant for this question, but for the sake of concreteness, let's say it's 1.
I will use the term "indel sequence" to refer to the sequence
TACC shown between the forward slash and the right square bracket in the SourceSeq excerpt shown above [sequence (0)].
I want to consider two cases, corresponding, respectively, to whether the variant is a deletion or an insertion (both relative to the reference genome).
First, suppose the variant is a deletion. This means that, somewhere in the vicinity of position 1000000 on chromosome 1, the reference genenome contains the indel sequence, and, more precisely, that it looks like this
My first question is: which of the bases shown above [in sequence (1)] is position 1000000? I imagine it is either the G right before the indel sequence, or the T at the beginning of the indel sequence, but I don't know which it is.
Second, suppose now that the variant is an insertion. This means that, in the vicinity of position 1000000 on chromosome 1, the reference genome does not contain the indel sequence, and more precisely, that it looks like this
My second question is: which of the bases shown above [in sequence (2)] is position 1000000? I imagine it is one of the two G's shown but I don't know which it is.
Finally, my third (and most important!) question is: are these rules/conventions for annotating the position of indel variants documented anywhere? Or is this bioinformatics folklore?