Question: Understanding terms, specifically relating to GT:PL in a VCF file
0
gravatar for rightmirem
18 months ago by
rightmirem60
rightmirem60 wrote:

I’ve returned to bioinformatics after some time away from my academic training. I have genetics lab experience, and I'm a solid IT/developer/data - but I'm weak in my pure bioinformatics.

I’m currently working on a project involving variant calling. The current step is filtering a BCF vile into useful calls. I’ve been reading up on the GT:PL and getting a loose grasp on it. However, Google (and even the Biostars Handbook) have left me with some confusion.

First, I need to make sure I’m understanding terms correctly.

Speaking only to BAM and BCF/VCF files…what is the difference between a “sample” and a “read”. I understand that NGS sequencing uses dozens to hundreds of PCR amplified ‘reads’. However, when I read about samples, it SEEMS that they are discussing a single, matching forward and reverse strand alignment.

So, when I see a PL of 0/1 meaning a "sample" is heterozygous REF/ALT...is this actually referring to the SENSE strand being REF and the ANTISENSE strand being ALT?

FOR EXAMPLE: This line from my BCF file.

1       13868   .       A       G       197     .       DP=138;VDB=0.297615;SGB=-0.693147;RPB=0.0995638;MQB=0.826721;MQSB=0.501179;BQB=0.722136;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=52,38,18,30;MQ=30    GT:PL 0/1:230,0,255
  • I'm assuming this is one "sample", since it is the only line in the file that references position 13868 on Chromosome 1.

  • I also assume this one "sample" is made up from dozens (perhaps hundreds) of "reads"

  • Because this is a consensus of "reads", it ends up referring to one "forward" strand and one "reverse" strand, figured out by aligning all these myriad of reads

  • And, based on its consensus of reads, it has come to a conclusion of odds "230,0,255" meaning it is statistically likely (from all these hundreds of reads) that the forward strand matches the reference "A", and the reverse strand is a SNP of "G"

ADD COMMENTlink modified 17 months ago by Kevin Blighe41k • written 18 months ago by rightmirem60
1

The sense and antisense strand are reverse complements of each other, they are essentially the same sequence. But a diploid organism has two copies of the whole genome, so each copy can have a different letter at a given point. A non-clonal mix of organisms can also show two (or more) different letters at a given point.

ADD REPLYlink written 17 months ago by swbarnes25.2k
6
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe41k
Kevin Blighe41k wrote:

Hi,

Sample

Well, a sample is a sample of DNA or cDNA that has been sequenced - it relates to a person, patient, or specific site biopsy of an individual (if dealing with humans). Typically, DNA is extracted from blood or a buccal swab biopsy and then this DNA is sequenced - this is a single sample. In cancer, it could be a tumour biopsy.

Sample prep

When a single DNA sample is being prepared for sequencing, it will typically be digested by enzymes, amplified by PCR, and then 'thrown' into the sequencer, or, to be more proper, it will be pipetted onto on a chip [Edit:] or onto a lane on a chip along with other samples in other lanes. In many situations, multiple samples are included together and they are distinguished later by software by unique sequence barcodes that are added during the preparation steps - this is called multiplexing.

In the sequencer / Reads

In the sequencer, each DNA fragment per sample is sequenced and, concurrently, read by a fluorescent detector (Illumina) or through the change in acidity as measured across a semiconductor (Ion Torrent, etc.). The sequencer will typically only sequence one strand, i.e., sense or antisense, and the DNA from both paternal and maternal chromosomes will be sequenced.

After the sequencing run has finished, we will have data in the form of fluorescent intensities or electrical signals for each DNA fragment from the initial sample. These are then decoded as base-calls through the use of a program called a base caller, which interprets the signals and infers the bases. Each DNA fragment will then have a single associated 'read' of bases, which is recorded in a FASTQ file typically. These are then obviously the reads. There will be 1 FASTQ per sample (or 1 FASTQ pair if mate-pair sequencing).

Analysis

Aligners and variant callers will process the FASTQ file(s) and produce a single VCF. If multiple samples were sequenced, we can merge all of the data into a single VCF and maintain sample-specific variant calls by adding extra columns to the VCF after the FORMAT column. Take a look here to understand better the VCF format: http://www.internationalgenome.org/wiki/Analysis/vcf4.0/

Encoding in the VCF

Each line in the VCF reports a variant that was encountered in the sample when comparing the sample's reads against the elected reference genome. The 0/1 is not the PL. It is the GT, i.e., genotype, and indicates the zygosity of the reported variant on whose line this appears in the VCF. 0/1 indicates a heterozygous call, meaning that both a statistically significant proportion of reference and variant bases were found at the position of interest. This means that one of the maternal or paternal chromosomes carried the variant base, whilst the other carried the reference base. 1/1 would indicate that both maternal and paternal carried the variant base. This does not relate to sense/antisense.

Phasing

If the GTs in the VCF have the pipe symbol instead of the forward slash, i.e., 1|0, 1|1, or 0|1, it means that we know which variants were called on the same paternal or maternal chromosome, i.e., the same haplotype. To get this level of information, faithfully, requires a special type of sample prep that I won't go into right now. To interpret this data, all variants reported on the left of the pipe symbol will be on the same haplotype, whilst all on the right will be on the another.


This is the best that I can do for now in terms of an answer. Others will be happy to provide more material.

Kevin

ADD COMMENTlink modified 17 months ago • written 17 months ago by Kevin Blighe41k
1

The sequencer will typically only sequence one strand, i.e., sense or antisense

Although it doesn't really influence the point you make I don't think this is fully correct - as paired-end sequencing is very common and will get you complementary evidence of both strands.

ADD REPLYlink written 17 months ago by WouterDeCoster38k
1

Yes, I was hoping for someone to add to that part, as I knew that it was only in part true and not applicable to all types of sequencing protocols. There is much diversity in these protocols, so, I tried to keep it as general as possible.

The aligner will determine which reads were sense or antisense, and be capable of aligning them to the same position. Sense- or antisense-specific information can be used in certain types of experiments, but it's not applicable to mention in this thread.

ADD REPLYlink written 17 months ago by Kevin Blighe41k
2
gravatar for WouterDeCoster
17 months ago by
Belgium
WouterDeCoster38k wrote:

My answer is only meant as a minor extension to the one of Kevin - since that one is already quite complete.

You seem to be confused about DNA strands. Remember that the DNA strands are complementary: an A on the sense strand is matched to a T on the antisense strand. As such your assumption about REF/ALT on either strand is wrong. Rather, this refers to the two chromosomes. In the case of a 1/0 genotype, you have one chromosome (e.g. maternal) with the reference allele and one chromosome (e.g. paternal) with the alternative allele. So far this is plain genetics.

A single nucleotide (chromosome 1, position 13868) is typically covered by multiple reads, each contributing to the evidence you have for that position. Roughly the number of maternal reads is about the same as the number of paternal reads. Reads are not equally distributed across the genome due to various reasons, including GC bias, sample prep biases and the locus itself. A read is an independent* observation of the genome of your sample. Assume you have 20 reads covering that nucleotide, then you will have ~10 reads derived from the maternal chromosome with the A allele, and ~10 reads from the paternal chromosome with the G allele. So the evidence here is pretty good that you have a heterozygous A/G SNP.

ADD COMMENTlink written 17 months ago by WouterDeCoster38k

Thanks Wouter, I was hoping for someone else to add to the discussion.

ADD REPLYlink written 17 months ago by Kevin Blighe41k

Thanks! And that was exactly the root of my question, actually. If the GT was sense/antisense, that meant that the 0/1 reads would have to be erroneous and otherwise mismatched strands - which didn't really make sense to me...but if it were correct, then my understanding of the VCF file was going to dramatically change :D I was concerned this meant any 0/1 or 1/0 data had to be discarded.

ADD REPLYlink written 17 months ago by rightmirem60

To follow the train of thought, there is an extension of the standard method that allows the sequencing of both sense and antisense. This is then used to corroborate variant calls, which should match on both of these strands for every genuine variant that's encountered.

ADD REPLYlink written 17 months ago by Kevin Blighe41k
0
gravatar for rightmirem
17 months ago by
rightmirem60
rightmirem60 wrote:

Im not sure if this is best asked here, as a continuation of thought - or in a separate thread (please feel free to advise).

So...what does AC (allele count), AN (allele number), and AF (allele frequency) mean - specifically as they relate to a VCF file. I assume thyre based on the number of reads?

I ask, for example, because the above sample line has an AC=1, but an AN=2.

ADD COMMENTlink modified 17 months ago • written 17 months ago by rightmirem60

This could probably have been a new question, but not to worry. Note that the VCF header should have a definition of all of these parameters - just search [in the header]. The VCF specification also explains them:

  • AC allele count in genotypes, for each ALT allele, in the same order as listed
  • AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
  • AN total number of alleles in called genotypes

So they don't directly relate to the reads, but obviously the reads are important in deriving these numbers!

AC will generally relate to the zygosity, and it's counting the number of variant alleles. In most situations, if the call is heterozygous (0/1), then AC will be 1; whereas, if it's homozygous (1/1), AC will be 2.

AN is just the total number of alleles, i.e, reference + variant alleles. This will generally corroborate with the ploidy of the organism (but deviate in things like cancer).

ADD REPLYlink modified 17 months ago • written 17 months ago by Kevin Blighe41k

Thanks for the answer. I've managed to get my head around AC and AN, but what is AF? The header information (as well as a lot of what I find online) is not really clear. So...

"AF allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes"

So, (for clarification) if at position 12345 a diploid organism has a GT=1/1 ... this means both chromosomes are variant in this position...making their AC=2, and AN=2.

But I have VCF files from single individuals (sample is one diploid organism where AC can never be greater than 2) - and my amalgamated VCF file of 70-single-individuals (where the AC could be 140). But filtering any of these (including the amalgamated VCF) with an AF > 1 results in no data.

ADD REPLYlink written 17 months ago by rightmirem60

AF is generally equal to AC / AN. As AC is limited by AN (it can never be greater), AF can neither be greater than 1.0, i.e., 100%. This is why there's no data when you search for variants with AF>1.

So, in your example:

So, (for clarification) if at position 12345 a diploid organism has a GT=1/1 ... this means both chromosomes are variant in this position...making their AC=2, and AN=2.

...AF should be 1.0 here, and both AC and AN should be 2.

Exceptions to this occur where multiple variants are called at the same position, which is mostly seen on indel calls. These are 'multi-allelic' sites and will have AN>2; however, AC will still never exceed AN in these situations, thus, AF will neither exceed 1. Exception obviously also occur in cancer where a tumour may have duplicated some chromosomes up to 70 or more times.

Does that help?

ADD REPLYlink modified 17 months ago • written 17 months ago by Kevin Blighe41k
1

It does help! Thank you!

ADD REPLYlink written 17 months ago by rightmirem60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 925 users visited in the last hour