Generate peptide sequences from VCF
2
1
Entering edit mode
2.2 years ago
Ritu_K ▴ 20

I have multiple VCFs for single cell RNA seq data and I want to get peptide sequences from these files. I have searched and found out that we can annotate the VCF with Ensembl's VEP to get the amino acids/protein information. However, I am looking to get a fasta file as an output which can be used for downstream analysis. I found out that GATK's FastaAlternateReferenceMaker can be used to get a fasta file from VCF.

Can I use the output VCF from VEP as an input to the FastaAlternateReferenceMaker to get the required fasta? I am not sure if I should pass the entire VCF as an input or just the protein information. Please help me get a better understanding of this procedure.

vcf vep protein sequence ensembl • 2.1k views
ADD COMMENT
2
Entering edit mode
2.2 years ago
Ritu_K ▴ 20

I found out that this can be done in the following way:

  1. Find the proteins in the mutations using Ensembl's VEP and sending the VCF files as an input. By doing so we will get several information as an output but we are interested in Ensembl’s VEP protein ID which starts with ENSP. (VEP output format provided here: https://m.ensembl.org/info/docs/tools/vep/vep_formats.html#vcfout)
  2. Once we have the protein IDs we can use the REST API provided by Ensembl and pass the ID to get the fasta: https://rest.ensembl.org/sequence/id/ENSP00000404426?content-type=text/x-fasta;type=protein
ADD COMMENT
0
Entering edit mode

With ENSP you will get the reference protein sequence though correct?

ADD REPLY
0
Entering edit mode

Yes correct.

ADD REPLY
2
Entering edit mode
2.2 years ago
Ben_Ensembl ★ 2.4k

Hi ritu k,

Not a direct answer for your question, but you may want to consider using the 'Haplosaurus' tool for predicting protein sequences: https://www.nature.com/articles/s41467-018-06542-1

The Haplosaurus is a VEP-like tool that uses phased VCF files to predict protein haplotypes. This approach offers an advantage over VEP's analysis, which treats each input variant independently. By considering the combined change contributed by all the variant alleles across a transcript, the compound effects the variants may have are correctly accounted for, giving a more accurate representation of the protein changes resulting from the reported genomic variants. You can find more information in our documentation: https://www.ensembl.org/info/docs/tools/vep/haplo/index.html

The Haplosaurus does not provide protein FASTA sequences as an output, so you may wish to contact the GATK forum for further advice about using the FastaAlternateReferenceMaker: https://gatk.broadinstitute.org/hc/en-us/articles/360037594571-FastaAlternateReferenceMaker

ADD COMMENT
0
Entering edit mode

Thank you, Ben for the information. I will definitely try out this tool.

ADD REPLY

Login before adding your answer.

Traffic: 1507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6