Extremely inexperienced biochemical scientist needing help manipulating a VCF file for Alissa pipeline development
1
0
Entering edit mode
3.8 years ago

Hello experts,

I would like to preface this post by saying that I'm a biochemical geneticist by trade and primarily work with biochemical pathways and biomarker detection. I have zero coding experience and my limited molecular genetic experience is exclusively in variant curation.

I have a keen interest in mitochondrial disorders, so when the opportunity arose to work on a project to help establish my facility as a center for WGS anaylsis, I jumped on it (without really thinking!). I have a VCF file from NextGENeV2.4.1 (I'm currently working with this in excel) that does not contain mutations in the mitochondrial genome. I would like to manipulate the VCF file to include about a dozen or so common mitochondrial mutations to be annotated using Alissa 5.3.

Question: 1. Are there any resources available (I've found this: http://samtools.github.io/hts-specs/VCFv4.2.pdf) that can tell me what I'm looking at in the current VCF file and what it means? My biochemical brain only sees letters and numbers.... What data is critical to have to feed into Alissa? 2. Is Alissa the best platform to be using for mitochondrial genomes? If not, what are other suggestions?

I'm feeling very out of my depth here.

Thank you!

genome next-gen VCF annotation sequencing • 733 views
ADD COMMENT
0
Entering edit mode

The VCF specification document is the go-to resource to understand VCF files. I can try and simplify it a little:

  1. All lines that begin with ## are header lines with meta-information. These lines describe the information contained in the VCF file. If you imagine a table with headers, this part would describe what the column names actually mean.
  2. The line that begins with #CHROM is kind of like the table header - there are 8 fixed columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER and INFO. Columns 9, called FORMAT, describes the format that coolumns 10 on would take. Columns 10 on are one column per sample described in the VCF.
    • CHROM, POS, REF and ALT are pretty self-explanatory. The change you're looking at is at the position POS in chromosome CHROM, where the reference base(s) REF has been altered to base(s) ALT.
    • ID comes from a database of existing mutations. It most commonly contains dbSNP's rs identifiers or COSMIC's identifiers or . if no existing variant matched to that location.
    • QUAL is a quality score of some sort that I do not really remember (because I don't use it much). The spec doc will describe it in detail
    • FILTER is a flag listing all the filters that the entry failed. For a list of filters used, look at the ##FILTER lines in the meta-information
    • INFO is sort of like a list of things about the mutation. There can be a ton of information here, usually stored as key=value pairs (key1=value1;key2=value2;...). Check out the ##INFO lines for description on what each of the keys mean.
    • FORMAT describes the format of information in columns 10 on. Think of all columns starting at FORMAT as a table - this column would be the header of that table. If this column contains 5 values separated by :, every subsequent column will have 5 values separated by :. and the hth value in this column will be the header for the ithvalue in subsequent columns.

The spec has examples that will help understand the format better. A quick summary would be:

  1. Columns 1-8 describe each location where any sample differs from the reference
  2. Columns 10 on describe how each sample differs from the reference

An important FORMAT field is the GT field, which gives us the genotype of the change. Here, 0 is the REF allele, and other numbers are ALT alleles in the order listed in the ALT field. So, for a diploid organism, 0/0 is hom-ref; 0/1 is heterozygous and 1/1 is homozygous mutant.

I hope I haven't confused you more than the spec doc.

ADD REPLY
0
Entering edit mode
3.7 years ago
JC 13k
  1. VCF is a simple table with complex content, the basic information are the first 9 cols and the rest are the genotypes and other info encoded. I have never used Alissa (have to google to learn is an Agilent product), so it's better to use their customer support.

  2. Open Source alternatives are Gemini, Exomiser and VEP

ADD COMMENT

Login before adding your answer.

Traffic: 2547 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6