Some questions about write human mitochondrial variants into VCF file
1
0
Entering edit mode
2.8 years ago
MatthewP ▴ 910

Hello, I have a variants result of mtDNA sequencing. Here is my result like:

SampleID        Pos     Ref     Variant Major/Minor     Variant-Level   Coverage-FWD    Coverage-Rev    Coverage-Total
R07058.bam      9090    T       C       C/A     0.9974  2100    2136    4236


This result comes from mtDNA-Server. Major means major nucleotide at 1 site, minor means opposite. Variant-Level seems to mean the ratio of variants, but I am not sure about that.

I want to annotate those variants by using snpEff which needs input VCF file, so I try to write a python script to convert this to VCF format file. I already read VCF format required before I started.

Considering that mitochondrial is haploid I separate each variant of same site as different variants in VCF. In this example it would be 2 lines in VCF:

#CHROM  POS     ID      REF     ALT ...
MT      9090    .       T       A ...
MT      9090    .       T       C ...


I hope this solution is right.

My questions is about INFO columns in VCF. mtDNA is haploid however it may have many(unknow) copies in cell, I don't know how to fill this tag in INFO:

1. AC : allele count in genotypes, for each ALT allele, in the same order as listed.
2. AN : total number of alleles in called genotypes.
3. DP : combined depth across samples, e.g. DP=154. I know it would be many depth values because more than 1 sample will be put in 1 line in VCF. But I don't know what is combined depth across them and how to calculate.

Any help is appreciate.

VCF mtDNA • 1.6k views
1
Entering edit mode

Hello MatthweP,

could you please describe what the columns Major/Minor and Variant-Level are for? Why do you need a vcf file?

Also it is better to use the code button in the formatting bar to show file contents. I've done it for you this time.

fin swimmer

0
Entering edit mode

Thanks for your advice, I have re-edit this question and explain Major/Minor means.

0
Entering edit mode

Hello MatthewP,

thank you for adding information to your question. But I still doesn't understand what is meant by Major/Minor? Because in the Variant column there is only a C.

Also it is necessary to understand why you need a vcf file. In the easiest case your vcf file just need values in the CHROM, POS, REF and ALT column. All other mandatory fields can be filled with . if these information aren't needed for downstream analyses.

fin swimmer

0
Entering edit mode

Majo/minor is a column that is included in the results generated from mtDNA server. it creates two profiles based on variant allele frequencies - major and minor and this info is used to perform haplogroup checks for each heteroplasmic site

0
Entering edit mode

Thank you Nandini! Can I ask where you get all this information about mtDNA server? There is no detail document on github project. Actually I have to guess all those tags means.

0
Entering edit mode

Hi Matthew, I've used mtDNA server before setting up my own pipeline for our lab. Have you read the paper for the tool ? It should be given in that.

0
Entering edit mode

Yep, I read the paper before I download this tool. I also want to set my own pipeline, but I don't know how for I am just a beginner of bioinformatics. How do you do the variant call job? Do you have some guidance for me about building this mtDNA pipeline?

0
Entering edit mode

Sure, I can help you with that but it would be useful to know what is the aim of your project ? what samples are you analysing ? Why do you need to convert the results into vcf format ? do you only need to call variants or do you need to perform further downstream analysis ?

0
Entering edit mode

Thank you! Can I have your e-mail address? I will send e-mail to discuss with you.

1
Entering edit mode

Please don't ask for email addresses. We like to keep the discussion open and on the forum so it benefits everyone.

0
Entering edit mode

Well, I work in a company offering sequencing service. This is our company first time received mtDNA order. Our client want us to analyse heterogeneous of mtDNA(variants) and copy number variants(CNV). They are using multi-PCR to obtain mtDNA library, so I think we can't get CNV from such data, there is no nuclear genome to normalize between samples. I want to offer them very good variants report. Here is my pipeline requirements:

1. QC control and mapping. I currently using bwa to do mapping job, but confusing using which reference, I currently using rCRS recommend by rCRS vs. RSRS vs. HG19 (Yoruba).. Is rCRS the same with chrMT of GRCH38 or HG38? (If i use whole human genome as reference some reads will mapped to other chromosome especially chr2)
2. Variant Calling. I totally no idea about it.
3. Annotation, snpEff seems good to me. Any other suggestions?
4. If possible, I want to give some biological or medical analyse of those variants, for example some SNP may causing some disease. I am trying to find some database may be useful on MITOMAP . I never done such job before, maybe I need some tools beside all those database?

Detail about sequencing method: Library construction using MultipSeqTm AImumiCap Panel which use 129 paired primers to PCR whole mtDNA.

1
Entering edit mode

There are several publications and automated pipelines that does this for you but as you work for a company, you need to see if these softwares are freely available for you to use.

So my pipeline for mtDNA analysis is as follows

1.Mapping: BWA with rCRS (hg19)

1. Mark duplicates with Picard

2. Variant calling: samtools and varscan

3. Variant annotation: annovar

4. Additional annotation: Mitomap

Hope this helps. Good luck

0
Entering edit mode

We like to set our own pipelines so it's easy to maintains and upgrade. Thank you very much I will try your pipeline.

0
Entering edit mode

Okay. But definitely do some research before implementing the pipeline as some of the tools may or may not suit your requirements

0
Entering edit mode

Ok, I need to annotate those variants using snpEff which input VCF file.

2
Entering edit mode
2.8 years ago

I hope this solution is right.

no, in a valid VCF you should find only one CHROM/POS/REF. See the VCF spec, for example for the attribute associated to the ALT allele (e.g AF, Number='A'), you should find the same number of data than the number of ALT allele. Example:

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
#CHROM  POS     ID      REF     ALT ... INFO
MT      9090    .       T       A,C ... AN=100;AC=1,50;AF=0.01,0.5

0
Entering edit mode

Thanks, I will check VCF protocol again! However I still don't know how to decide the AC and AN values, because I don't know the copy number of mtDNA. If one of the variant is deletion, should it also be same line with SNP? Like:

#CHROM  POS     ID      REF     ALT ...
MT    9090    .    AT    A, AC ...


Am i understanding this right?

0
Entering edit mode

There are vcf validators. Try one of them.