I would like to create consensus sequence for an individual from 1000G where the sequence incorporates variants typed for this individual.
Using tabix and vcf-tools (vcf-subset), I extracted the relevant variant information from 1000G SNP call files.
At this point, I have variants per chromosome for individual HG000096.
I would like to incorporate these variants into reference coding sequence.
I downloaded all protein-coding sequences for the GRCh37 version from Ensembl Biomart.
An example with trimmed sequence looks like this:
>ENSG00000003137|ENST00000001146|ENSP00000001146|2|72356367|72375167 ATGCTCTTTGAGGGCTTGGATCTGGTGTCGGCGCTGGCCACCCTCGCCGCGTGCCTGGTG TCCGTGACGCTGCTGCTGGCCGTGTCGCAGCAGCTGTGGCAGCTGCGCTGGGCCGCCACT CGCGACAAGAGCTGCAAGCTGCCCATCCCCAAGGGATCCATGGGCTTCCCGCTCATCGGA
Note: typical fasta sequence file with the header including gene|transcript|protein|chr|chr_start|chr_end
My understanding is that there are a few alternatives out there to produce consensus sequences given a reference fasta sequence and variant call file:
- vcf consensus
I would like to use vcf-consensus for my task.
vcftools describes the use of vcf-consensus as follows:
cat ref.fa | vcf-consensus file.vcf.gz > out.fa
I presume that the reference file here refers to the reference DNA sequence including coding and non-coding parts of the genome. In that case, my protein-coding sequence above would not work as an acceptable reference sequence. Nevertheless, I am only interested in the protein-coding part.
How can I modify vcf-consensus or my input sequences to create consensus coding sequence for HG00096 given the reference coding sequence and his variants? Do I have to apply vcf-consensus to entire gene sequence first and then extract the relevant parts - which would be very tedious?
Thank you very much.