How can I use my own genome file on Variant Effect Predictor (VEP)
2
0
Entering edit mode
15 months ago
caro-ca ▴ 20

Dear community, I am working with Variant Effect Predictor and I was wondering if I could use my own genome instead of the predefined genome as I could not find the option for that. Thank you in advance.

VEP ENSEMBL VARIANT EFFECT PREDICTOR • 919 views
2
Entering edit mode
15 months ago
Ben_Ensembl ★ 1.8k

No problem- Ensembl has never used the term sacCer2, this is a name given by the UCSC genome browser. However, from the link you provided, it seems that the sacCer2 assembly presented in UCSC is the Ensembl version 62. I suggest that you double check that the S. cerevisiae genome assembly presented in Ensembl 62 corresponds with your genome assembly of interest.

However, there are now pre-built VEP caches available for this data, so I suggest using the --custom option as described before with the FASTA and GTF files available for Ensembl 62 from the Ensembl FTP site: http://ftp.ensembl.org/pub/release-62/

0
Entering edit mode

Thank you so much for your help! I downloaded the GTF and FASTA files provided in your link. Consequently I followed the steps provided (here):

grep -v "#" Saccharomyces_cerevisiae.EF2.62.gtf | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > sacCer_EF2.gtf.gz tabix sacCer_EF2.gtf.gz  To then follow: ./vep [...] --custom Filename , Short_name , File_type , Annotation_type , Force_report_coordinates , VCF_fields /home/silviav/ensembl-vep/./vep --custom ~/ensembl-vep/sacCer2/sacCer_EF2.gtf.gz , , gtf , overlap , 1  In the previous code, I did not understand what I should include in the argument VCF_fields. However, I run the code by itself and I got this error message: -------------------- EXCEPTION -------------------- MSG: ERROR: No format specified for custom annotation source /home/silviav/ensembl-vep/sacCer2/sacCer_EF2.gtf.gz STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_custom /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:208 STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:93 STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:175 STACK Bio::EnsEMBL::VEP::Runner::init /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:123 STACK Bio::EnsEMBL::VEP::Runner::run /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194 STACK toplevel /home/silviav/ensembl-vep/./vep:227 Date (localtime) = Wed Aug 19 12:18:02 2020 Ensembl API version = 100  Once the previous code is fixed, my plan is to run: /home/silviav/ensembl-vep/./vep -i /hosts/samples_all_merged.vcf --cache --gtf sacCer_EF2.gtf.gz --fasta Saccharomyces_cerevisiae.EF2.62.dna.toplevel.fa.gz  I hope you could let me know if I am doing the procedure correctly and what it is missing. Thank you for your constant help. ADD REPLY 0 Entering edit mode No problem- the --custom options allow you to further configure the custom file and the VEP behaviour. Following the preparation of the GTF file, which you seem to have been able to completed successfully, you should be able to run the VEP query as you described. There is further advice in the documentation: https://www.ensembl.org/info/docs/tools/vep/script/vep_cache.html#gff ADD REPLY 0 Entering edit mode I followed that link, but I still do not understand the argument VCF_fields. The text says "You can specify any info type (e.g. "AC")" so what kind of type is "AC"? I do not know what to add to that argument. Secondly, how could I fix the "No format specified for custom annotation source /home/silviav/ensembl-vep/sacCer2/sacCer_EF2.gtf.gz" error? because I cannot preceed with the next code that is /home/silviav/ensembl-vep/./vep -i /hosts/samples_all_merged.vcf --cache --gtf sacCer_EF2.gtf.gz --fasta Saccharomyces_cerevisiae.EF2.62.dna.toplevel.fa.gz Thank you in advance. ADD REPLY 0 Entering edit mode You do not need to run the command where you specify the --custom options. If you do, it should be included as a flag within your VEP query. You should be able to run your VEP query using --cache --gtf which is a shortcut to: --custom data.gtf.gz,,gtf You should use the --custom flag if you wish to customise the name of the GTF as it appears in the SOURCE field and VEP output header. The VCF_fields option in --custom allows you to specify particular fields if you are using a custom VCF file, so this does not apply to your example. ADD REPLY 0 Entering edit mode This is the stdout while running --cache --gtf: /home/silviav/ensembl-vep/./vep -i /hosts/samples_all_merged.vcf --cache --gtf sacCer_EF2.gtf.gz --fasta Saccharomyces_cerevisiae.EF2.62.dna.toplevel.fa.gz -------------------- EXCEPTION -------------------- MSG: ERROR: Cache directory /home/silviav/.vep/homo_sapiens not found STACK Bio::EnsEMBL::VEP::CacheDir::dir /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:311 STACK Bio::EnsEMBL::VEP::CacheDir::init /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:227 STACK Bio::EnsEMBL::VEP::CacheDir::new /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/CacheDir.pm:111 STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all_from_cache /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:115 STACK Bio::EnsEMBL::VEP::AnnotationSourceAdaptor::get_all /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/AnnotationSourceAdaptor.pm:91 STACK Bio::EnsEMBL::VEP::BaseRunner::get_all_AnnotationSources /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:175 STACK Bio::EnsEMBL::VEP::Runner::init /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:123 STACK Bio::EnsEMBL::VEP::Runner::run /home/silviav/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:194 STACK toplevel /home/silviav/ensembl-vep/./vep:227 Date (localtime) = Wed Aug 19 15:34:46 2020 Ensembl API version = 100 ---------------------------------------------------  ADD REPLY 0 Entering edit mode It seems that the VEP can't locate your cache. However, based on your query, I do not think you require the --cache option anyway. You should remove --cache from your query. ADD REPLY 0 Entering edit mode Thank you! I could run VEP with the custom genome, however, I have the feeling something is not complete as I only get intergenic variant as the major consequence of the insertions in my vcf file. This is the output obtained by running: vep -i samples_all_merged.vcf --gtf sacCer_EF2.gtf.gz --fasta saccharomyces_cerevisia_EF2.fa.gz  Where the GTF file was: grep -v "#" Saccharomyces_cerevisiae.EF2.62.gtf | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > sacCer_EF2.gtf.gz
tabix sacCer_EF2.gtf.gz


And the original FASTA file downloaded from the link above was uncompressed and compressed again with bgzip:

bgzip -c ~/ensembl-vep/sacCer2/Saccharomyces_cerevisiae.EF2.62.dna.toplevel.fa.gz > saccharomyces_cerevisia_EF2.fa.gz


Now, this is the output by using the Ensembl website. I tried to add parameters to VEP to get more consequences of the insertions as:

vep -i samples_all_merged.vcf --gtf sacCer_EF2.gtf.gz --fasta saccharomyces_cerevisia_EF2.fa.gz --regulatory --overlaps --gene_phenotype


But I did not get many differences from the one without parameters. Alternatively, I could use bedtools intersect between the gene file and the location of my insertions file as a sanity check, but first I wanted to check with you if I am missing something in the codes above.

Thank you very much for your help.

1
Entering edit mode

I'm glad you were able to get the VEP to run using the custom GTF and FASTA file. However, the --regulatory flag is only applicable to species with regulatory feature annotation and --gene_phenotype relies on underlying gene-phenotype associations.

However, I am unable to see your output file to comment on the 'intergenic' consequence prediction. Please contact the Ensembl Helpdesk (helpdesk[at]ensembl.org) with information of the query that you ran with the associated output file so that we can help to diagnose the issue.

Best wishes

Ben

1
Entering edit mode
15 months ago
Ben_Ensembl ★ 1.8k

Hi caro-ca,

To use a custom genome and annotation, you will need to use the VEP command-line tool, using the -custom flag and following documentation to prepare your custom cache: https://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html

Best wishes

Ben

0
Entering edit mode

Thank you. I will try that out.

0
Entering edit mode

Sorry if I did not make myself clear. I wanted to know if I could add my own --species data (as a FASTA file, for instance) instead of the predefined (saccharomyces_cerevisiae). Previously, this is the code I used:

/home/ensembl-vep/./vep -i /hosts/samples_all_merged.vcf -o /hosts/VEP_output_merged.vcf --species saccharomyces_cerevisiae --offline


Is it just enough to define the argument --fasta?

Thank you in advance.

0
Entering edit mode

No problem- so you want to analyse variants in a VCF file which were identified against your own sequence file, but using the gene annotation available in Ensembl for S. cerevisiae?

The VEP requires gene annotation (in the form of GTF or GFF annotation files) to predict the consequences of variants. Gene annotation is produced with respect to specific genome assemblies, sequences in FASTA format. To perform an accurate VEP analysis, you will need to have a gene annotation file that corresponds to your own S. cerevisiae sequence data.

0
Entering edit mode

Thank you for your clarification. My question was mainly because I saw VEP uses the assembly R64 or sacCer3 for Saccharomyces cerevisiae and I need the sacCer2. I found that there is the gene annotation for sacCer2 available on Ensembl , so how can I point out that the species should be Saccharomyces cerevisiae scaCer2 and not R64. If this is possible then does this mean that I do not need to upload the GFF file as Ensembl will look at its databases?