Question: problem in running vcf-consensus command
0
gravatar for rajesh.msch
2.2 years ago by
rajesh.msch0 wrote:

Hi,

I would like to create consensus sequence for an individual from 1000G where the sequence incorporates variants typed for this individual.

Using tabix and vcf-tools (vcf-subset), I extracted the relevant variant information from 1000G SNP call files.

At this point, I have variants per chromosome for individual HG000096.

I would like to incorporate these variants into reference coding sequence.

I downloaded all protein-coding sequences for the GRCh37 version from Ensembl Biomart.

An example with trimmed sequence looks like this:

>ENSG00000003137|ENST00000001146|ENSP00000001146|2|72356367|72375167
ATGCTCTTTGAGGGCTTGGATCTGGTGTCGGCGCTGGCCACCCTCGCCGCGTGCCTGGTG
TCCGTGACGCTGCTGCTGGCCGTGTCGCAGCAGCTGTGGCAGCTGCGCTGGGCCGCCACT
CGCGACAAGAGCTGCAAGCTGCCCATCCCCAAGGGATCCATGGGCTTCCCGCTCATCGGA

Note: typical fasta sequence file with the header including gene|transcript|protein|chr|chr_start|chr_end

My understanding is that there are a few alternatives out there to produce consensus sequences given a reference fasta sequence and variant call file:

AlternativeReferenceMaker, vcf consensus mpileup I would like to use vcf-consensus for my task.

vcftools describes the use of vcf-consensus as follows:

cat ref.fa | vcf-consensus file.vcf.gz > out.fa I presume that the reference file here refers to the reference DNA sequence including coding and non-coding parts of the genome. In that case, my protein-coding sequence above would not work as an acceptable reference sequence. Nevertheless, I am only interested in the protein-coding part.

How can I modify vcf-consensus or my input sequences to create consensus coding sequence for HG00096 given the reference coding sequence and his variants? Do I have to apply vcf-consensus to entire gene sequence first and then extract the relevant parts - which would be very tedious?

Thank you very much.

rna-seq • 720 views
ADD COMMENTlink modified 2.2 years ago by Malcolm.Cook1.0k • written 2.2 years ago by rajesh.msch0
0
gravatar for WouterDeCoster
2.2 years ago by
Belgium
WouterDeCoster38k wrote:

My strategy would be to first modify the full genome using GATK FastaAlternateReferenceMaker and then use bedtools getfasta to slice out the genes you want.

ADD COMMENTlink written 2.2 years ago by WouterDeCoster38k

no but i am looking for to use vcftool like vcf-consensus.... would you help me?

ADD REPLYlink written 2.2 years ago by rajesh.msch0
0
gravatar for Malcolm.Cook
2.2 years ago by
Malcolm.Cook1.0k
kansas, usa
Malcolm.Cook1.0k wrote:

You are correct that "...the reference file here refers to the reference DNA sequence including coding and non-coding parts of the genome."

The answer to your question "Do I have to apply vcf-consensus to entire gene sequence first and then extract the relevant parts - which would be very tedious?" is almost "Yes". Rather, "entire genome sequence"

So you now need a strategy to "extract the relevant parts".

If you have gff3 describing your gene models, I would recommend using gffread, part of the cufflinks suite, which can extract the fasta underlying your gene model coordinates in the gff3.

Except, your old gff3 is no longer any good since the genome has been edited.

Q: What to do?

A: Apply the the same edits you applied to your genome to the gff3!

Q: How?

A: use liftOver

Q: but liftOver requires a .chain file and I only have a .vcf file! How to convert?

A: use vcf2chain - part of g2gtools.

Q: Oy!

A: I know.

ADD COMMENTlink written 2.2 years ago by Malcolm.Cook1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1353 users visited in the last hour