I’ve returned to bioinformatics after some time away from my academic training. I have genetics lab experience, and I'm a solid IT/developer/data - but I'm weak in my pure bioinformatics.
I’m currently working on a project involving variant calling. The current step is filtering a BCF vile into useful calls. I’ve been reading up on the GT:PL and getting a loose grasp on it. However, Google (and even the Biostars Handbook) have left me with some confusion.
First, I need to make sure I’m understanding terms correctly.
Speaking only to BAM and BCF/VCF files…what is the difference between a “sample” and a “read”. I understand that NGS sequencing uses dozens to hundreds of PCR amplified ‘reads’. However, when I read about samples, it SEEMS that they are discussing a single, matching forward and reverse strand alignment.
So, when I see a PL of 0/1 meaning a "sample" is heterozygous REF/ALT...is this actually referring to the SENSE strand being REF and the ANTISENSE strand being ALT?
FOR EXAMPLE: This line from my BCF file.
1 13868 . A G 197 . DP=138;VDB=0.297615;SGB=-0.693147;RPB=0.0995638;MQB=0.826721;MQSB=0.501179;BQB=0.722136;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=52,38,18,30;MQ=30 GT:PL 0/1:230,0,255
I'm assuming this is one "sample", since it is the only line in the file that references position 13868 on Chromosome 1.
I also assume this one "sample" is made up from dozens (perhaps hundreds) of "reads"
Because this is a consensus of "reads", it ends up referring to one "forward" strand and one "reverse" strand, figured out by aligning all these myriad of reads
And, based on its consensus of reads, it has come to a conclusion of odds "230,0,255" meaning it is statistically likely (from all these hundreds of reads) that the forward strand matches the reference "A", and the reverse strand is a SNP of "G"