I am observing strange discrepancies between the information present in the VCF created by Pindel and the internal file format. Here is what I did:
- I downloaded Pindel from github 2 days ago with command: git clone https://github.com/genome/pindel.git
- I ran pindel on a sample BAM file using as reference the human GRCh37 g1k_v37 decoy genome sequence.
- I noticed that the VCF file does not have the support information, score, or quality measures, so I decided to recover it by merging the VCF file with the lines from the internal format. I recovered those lines with grep ChrID internal_file
However, not all the coordinates matched. For one particular case I had in the VCF:
1 10290621 . ATAGCTGGGATTACAGGTGTGTGCCACCACACCTGGTTAATTTTTGTATTTTTAATAGAGACGGGGTTTCACCGTGTTGGCTAGGCTGGTCTTGAT GTACTTGGGATTACTGGCGTACGCCACCACGCCCAGCTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTCAACCAGGCTGGTCTCGAA . PASS END=10290715;HOMLEN=0;SVLEN=-96;SVTYPE=RPL;NTLEN=96 GT:AD 0/0:0,1
and the internal format:
3305 D 96 NT 96 "GTACTTGGGATTACTGGCGTACGCCACCACGCCCAGCTAATTTTTGTATTTTTAGTAGAGACGGGGTTTCACCATGTCAACCAGGCTGGTCTCGAA" ChrID 1 BP 10290620 10290717 BP_range 10290620 10290717 Supports 1 1 + 0 0 - 1 1 S1 2 SUM_MS 99 1 NumSupSamples 1 1 pFDA_simTruth_76x_0.4_FEMALE 0 0 0 0 1 1
- Notice that in the VCF the position for the variant 10290621 and the internal file says 10290620. I went to UCSC Genome Browser and checked that the REF sequence is 96 bp and starts at 10290621 and ends at 10290716.
So now I have the following discrepancies for the (begin, end): reference (10290621,10290716). VCF (10290621, 10290715). internal (10290620, 10290717)
I repeated the exercise with a line where the start position matches:
1 10289908 . GGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGC GGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGCTATTCAGGAGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCTGAGATCGCACCACTGCACTCCAGCCTGGGTCACAGAGTGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAAAAGAAAATTAGGGGCCAGACGTGGTGGCTCACACCTATAATCCCAGC . PASS END=10290003;HOMLEN=0;SVLEN=95;SVTYPE=DUP:TANDEM;NTLEN=95 GT:AD 0/0:0,1
168 TD 95 NT 95 "TATTCAGGAGGCTGAGGCAGGAGAATCACTTGAACCCAGGAGGTGGAGGTTGCAGTGAGCTGAGATCGCACCACTGCACTCCAGCCTGGGTCACA" ChrID 1 BP 10289908 10290004 BP_range 10289908 10290004 Supports 1 1 + 0 0 - 1 1 S1 2 SUM_MS 99 1 NumSupSamples 1 1 pFDA_simTruth_76x_0.4_FEMALE 0 0 0 0 1 1
and in this case the (begin, end) coordinates are reference (10289908, 10290003), VCF (10289908, 10290003), and internal format (10289908, 10290004)
- I cannot make sense of it.
- In the first case the VCF coordinates are wrong and the internal format coordinates seem to be flanking.
- In the second case the VCF coordinates are correct but the internal format coordinates are not flanking.
The user manual does not explain anything of this, so I am clueless. Any help is appreciated.