Working with VEP in anaconda
2
0
Entering edit mode
2.5 years ago
luzglongoria ▴ 40

Hi there,

I'm working on a project for determining how this two protozoan organisms have evolved and in which genes they differ the most by looking at the SPNs.

I have done some previous analysis calling for variants by using some programs

# construct the FM-index for the reference genome (in this case the Parasite genome)
bwa index ref.fa

# Find the SA coordinates of the input reads

bwa aln ref.fa short_read.fq > aln_sa.sai

# Generate alignments in the SAM format given paired-end reads

# Transform .sam files into .bam files
samtools view -b -o output.bam input.sam

# Sort that bam file
samtools sort -@ 15 -o outfile infile

# Remove duplicates
java -Xmx2g -jar picard.jar MarkDuplicates MAX_FILE_HANDLES=500 REMOVE_DUPLICATES=true I= sorted.bam O=marked_duplicates.bam
M=marked_dup_metrics.txt

# Create .bai file for each .bam file
bamtools index -in file.bam

# Run Freebayes (looking for SNPs)
freebayes -f refernce.fasta file1.bam file2.bam file3.bam > combined.freebayes.vcf

# Compress and index the vcf files
bgzip -c combined.freebayes.vcf > combined.freebayes.vcf.gz
tabix -p  vcf combined.freebayes.vcf.gz


The problem now is that I ouwld like to use this .vcf file for determining the genes where there are more SPNs, the function of these genes and (if it is posible) determine how these two related organisms have evolved.

I know some programs like ANNOVAR http://annovar.openbioinformatics.org/en/latest/ or VEP https://www.ensembl.org/info/docs/tools/vep/index.html can be used in order to determine the effects of the variants (SNPs). Since I'm working on a conda environment, the problems I’m having are:

1) ANNOVAR cannot be installed in conda (or, at least, I haven’t found any command for installing it)

2) VEP can be installed in conda https://anaconda.org/bioconda/ensembl-vep but when installing it it says that the library that you need in order to do the analysis are not included so you need to run another command (vep_install) that it gives me errors each time I try to run it in my environment of conda.

vep_install -a cf -s plasmodium_relictum -y PRELSG -c /PATH/miniconda3/pkgs/ensembl-vep-98.3-pl526hecc5488_0/share/ensembl-vep-98.3-0/vep_install


and then

 - getting list of available cache files
ERROR: No matching species found at /PATH/miniconda3/envs/salmon/bin/vep_install line 1121, <> line 1.


What I am doing wrong? I just want to have the library in oder to use VEP in a normal way...

SNP anaconda conda vep RNA-Seq • 3.6k views
4
Entering edit mode
2.5 years ago
Emily 23k

In order to install non-vertebrate caches you have to do something slightly different. There's currently a bug which means the installer script can't install them, we're fixing it, so you need to do it manually. There's a cache file here and instructions for manually installing caches here.

0
Entering edit mode

Thank you so much. Following the instructions I have upload the cache file to where ./vep is installed and then run the command

tar xzf plasmodium_relictum_gca_900005765_vep_45_PRELSG.tar.gz


after that I have tried to run the command for VEP as usual but I get the error :

vep -i combined.11samples.subset.freebayes.vcf -o GRW4_VEP_results.txt --species plasmodium_relictum --database --genomes

-------------------- WARNING ----------------------
MSG: plasmodium_relictum is not a valid species name (check DB and API version)
FILE: Bio/EnsEMBL/Registry.pm LINE: 1334
CALLED BY: Bio/EnsEMBL/Registry.pm  LINE: 1109
Date (localtime)    = Thu Dec  5 11:13:54 2019
Ensembl API version = 95
---------------------------------------------------

-------------------- EXCEPTION --------------------
MSG: Can not find internal name for species 'plasmodium_relictum'
STACK Bio::EnsEMBL::VEP::BaseVEP::get_database_assembly /home/luz_garcia_longoria/miniconda3/envs/salmon/share/ensembl-vep-95.0-1/modules/Bio/EnsEMBL/VEP/BaseVEP.pm:419
STACK Bio::EnsEMBL::VEP::BaseRunner::setup_db_connection /home/luz_garcia_longoria/miniconda3/envs/salmon/share/ensembl-vep-95.0-1/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:123
STACK Bio::EnsEMBL::VEP::Runner::init /home/luz_garcia_longoria/miniconda3/envs/salmon/share/ensembl-vep-95.0-1/modules/Bio/EnsEMBL/VEP/Runner.pm:118
STACK Bio::EnsEMBL::VEP::Runner::run /home/luz_garcia_longoria/miniconda3/envs/salmon/share/ensembl-vep-95.0-1/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel /home/luz_garcia_longoria/miniconda3/envs/salmon/bin/vep:225
Date (localtime)    = Thu Dec  5 11:13:54 2019
Ensembl API version = 95


What I am doing wrong?

1
Entering edit mode

You're using version 95 of the VEP. Plasmodium relictum didn't come into Ensembl until release 96, which means the VEP can't find the species name in its list of aliases. Please update to the latest VEP, which is 98.

0
Entering edit mode

Thanks. I have been able to run the VEP 98 version but still I have the same problem:

/PATH/miniconda3/envs/salmon/ensembl-vep-98.3-0/vep -i combined.11samples.subset.freebayes.vcf  -o GRW4_VEP_results.txt --species plasmodium_relictum --database --genomes

Possible precedence issue with control flow operator at /PATH/miniconda3/envs/salmon/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 845.

-------------------- WARNING ----------------------
MSG: plasmodium_relictum is not a valid species name (check DB and API version)
FILE: Bio/EnsEMBL/Registry.pm LINE: 1334
CALLED BY: Bio/EnsEMBL/Registry.pm  LINE: 1109
Date (localtime)    = Thu Dec  5 13:17:27 2019
Ensembl API version = 98
---------------------------------------------------

-------------------- EXCEPTION --------------------
MSG: Can not find internal name for species 'plasmodium_relictum'
STACK Bio::EnsEMBL::VEP::BaseVEP::get_database_assembly /PATH/miniconda3/envs/salmon/ensembl-vep-98.3-0/modules/Bio/EnsEMBL/VEP/BaseVEP.pm:419
STACK Bio::EnsEMBL::VEP::BaseRunner::setup_db_connection /PATH/miniconda3/envs/salmon/ensembl-vep-98.3-0/modules/Bio/EnsEMBL/VEP/BaseRunner.pm:123
STACK Bio::EnsEMBL::VEP::Runner::init /PATH/miniconda3/envs/salmon/ensembl-vep-98.3-0/modules/Bio/EnsEMBL/VEP/Runner.pm:118
STACK Bio::EnsEMBL::VEP::Runner::run /PATH/miniconda3/envs/salmon/ensembl-vep-98.3-0/modules/Bio/EnsEMBL/VEP/Runner.pm:194
STACK toplevel /PATH/miniconda3/envs/salmon/ensembl-vep-98.3-0/vep:224
Date (localtime)    = Thu Dec  5 13:17:27 2019
Ensembl API version = 98

0
Entering edit mode

needs to be --genomes at the end, not --genome

0
Entering edit mode

I doesn't work with --genomes it gives the same error

1
Entering edit mode

Sorry, I forgot there is an extra option for some of the bacteria and protist databases. Please add is_multispecies 1 to your command. Because the database contains lots of strains, you also need to give the more specific genome name of --species plasmodium_relictum_gca_900005765. I know there is only one Plasmodium relictum genome in the database at the moment, but this allows us to add more in future without changing the aliases.