Question: Vcf-Consensus For Individuals From 1000 Genomes
5.4 years ago by
I would like to create consensus sequence for an individual from 1000G where the sequence incorporates variants typed for this individual.

Using tabix and vcf-tools (vcf-subset), I extracted the relevant variant information from 1000G SNP call files.

At this point, I have variants per chromosome for individual HG000096.

I would like to incorporate these variants into reference coding sequence.

I downloaded all protein-coding sequences for the GRCh37 version from Ensembl Biomart.

An example with trimmed sequence looks like this:


Note: typical fasta sequence file with the header including gene|transcript|protein|chr|chr_start|chr_end

My understanding is that there are a few alternatives out there to produce consensus sequences given a reference fasta sequence and variant call file:

  • AlternativeReferenceMaker,
  • vcf consensus
  • mpileup

I would like to use vcf-consensus for my task.

vcftools describes the use of vcf-consensus as follows:

cat ref.fa | vcf-consensus file.vcf.gz > out.fa

I presume that the reference file here refers to the reference DNA sequence including coding and non-coding parts of the genome. In that case, my protein-coding sequence above would not work as an acceptable reference sequence. Nevertheless, I am only interested in the protein-coding part.

How can I modify vcf-consensus or my input sequences to create consensus coding sequence for HG00096 given the reference coding sequence and his variants? Do I have to apply vcf-consensus to entire gene sequence first and then extract the relevant parts - which would be very tedious?

Thank you very much.

cross posted on SE:

ADD REPLYlink written 5.4 years ago by Pierre Lindenbaum118k
2.1 years ago by
Hi, Same problem i am facing here.

Can anybody help me to figure out?

ADD COMMENTlink written 2.1 years ago by rajesh.msch0
