Question

problem in running vcf-consensus command

0

Entering edit mode

7.2 years ago

rajesh.msch • 0

Hi,

I would like to create consensus sequence for an individual from 1000G where the sequence incorporates variants typed for this individual.

Using tabix and vcf-tools (vcf-subset), I extracted the relevant variant information from 1000G SNP call files.

At this point, I have variants per chromosome for individual HG000096.

I would like to incorporate these variants into reference coding sequence.

I downloaded all protein-coding sequences for the GRCh37 version from Ensembl Biomart.

An example with trimmed sequence looks like this:

>ENSG00000003137|ENST00000001146|ENSP00000001146|2|72356367|72375167
ATGCTCTTTGAGGGCTTGGATCTGGTGTCGGCGCTGGCCACCCTCGCCGCGTGCCTGGTG
TCCGTGACGCTGCTGCTGGCCGTGTCGCAGCAGCTGTGGCAGCTGCGCTGGGCCGCCACT
CGCGACAAGAGCTGCAAGCTGCCCATCCCCAAGGGATCCATGGGCTTCCCGCTCATCGGA

My understanding is that there are a few alternatives out there to produce consensus sequences given a reference fasta sequence and variant call file:

AlternativeReferenceMaker, vcf consensus mpileup I would like to use vcf-consensus for my task.

vcftools describes the use of vcf-consensus as follows:

cat ref.fa | vcf-consensus file.vcf.gz > out.fa I presume that the reference file here refers to the reference DNA sequence including coding and non-coding parts of the genome. In that case, my protein-coding sequence above would not work as an acceptable reference sequence. Nevertheless, I am only interested in the protein-coding part.

How can I modify vcf-consensus or my input sequences to create consensus coding sequence for HG00096 given the reference coding sequence and his variants? Do I have to apply vcf-consensus to entire gene sequence first and then extract the relevant parts - which would be very tedious?

Thank you very much.

RNA-Seq • 2.3k views

ADD COMMENT • link updated 7.2 years ago by Malcolm.Cook ★ 1.5k • written 7.2 years ago by rajesh.msch • 0

score 0 · Answer 1 · 2017-02-23

0

Entering edit mode

7.2 years ago

WouterDeCoster 47k

My strategy would be to first modify the full genome using GATK FastaAlternateReferenceMaker and then use bedtools getfasta to slice out the genes you want.

ADD COMMENT • link 7.2 years ago by WouterDeCoster 47k

0

Entering edit mode

no but i am looking for to use vcftool like vcf-consensus.... would you help me?

ADD REPLY • link 7.2 years ago by rajesh.msch • 0

score 0 · Answer 2 · 2017-02-23

You are correct that "...the reference file here refers to the reference DNA sequence including coding and non-coding parts of the genome."

The answer to your question "Do I have to apply vcf-consensus to entire gene sequence first and then extract the relevant parts - which would be very tedious?" is almost "Yes". Rather, "entire genome sequence"

So you now need a strategy to "extract the relevant parts".

If you have gff3 describing your gene models, I would recommend using gffread, part of the cufflinks suite, which can extract the fasta underlying your gene model coordinates in the gff3.

Except, your old gff3 is no longer any good since the genome has been edited.

Q: What to do?

A: Apply the the same edits you applied to your genome to the gff3!

Q: How?

A: use liftOver

Q: but liftOver requires a .chain file and I only have a .vcf file! How to convert?

A: use vcf2chain - part of g2gtools.

Q: Oy!

A: I know.