Question

Construction of the reference genome database (GCA_000001405.15_GRCh38) with snpeff

0

Entering edit mode

2.5 years ago

Adham • 0

Dear colleagues

I used the reference genome GRCh38 version GCA_000001405.15_GRCh38 / seqs_for_alignment_pipelines.ucsc_ids downloaded from https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/ This version was used for alignment and variant calling, however, I wanted to annotate genetic variants by snpeff v5. I did not find this version of the genome in the snpeff config file. The versions found are those of UCSC (http://hgdownload.cse.ucsc.edu) and ncbi (GRCh38.p13.RefSeq)

IF anyone familiar with snpeff, I would like to know how I can build a database with the version that was used for the variant calling as I had, knowing that this version (GCA_000001405.15_GRCh38 / seqs_for_alignment_pipelines.ucsc_ids). Unless I'm mistaken, I couldn't find the annotation files for this version which is recommended for alignment.

FYI: I used the version of from UCSC for the annotation but I found error messages like "WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS", and as mentioned in the snpeff manual the original coordinates of the VCF file are not exactly the same as the coordinates used to calculate the variant annotation .

Thank you,

reference snpeff genome alignement annotation • 1.1k views

ADD COMMENT • link updated 2.5 years ago by vkkodali_ncbi ★ 3.7k • written 2.5 years ago by Adham • 0

score 0 · Answer 1 · 2021-11-01

0

Entering edit mode

2.5 years ago

vkkodali_ncbi ★ 3.7k

You can use the GRCh38.p13.RefSeq database provided by snpEff for this. Additional information about the snpEff human genome databases is in their documentation: https://pcingola.github.io/SnpEff/se_human_genomes/

Alternately, you can build your own database starting from the GFF3 or GTF files located in the seqs_for_alignment_pipelines directory of NCBI FTP. Instructions on how to build a custom database are described in snpEff documentation: https://pcingola.github.io/SnpEff/se_buildingdb/

ADD COMMENT • link 2.5 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Hi @vkkodali, I used GRCh38.p13.RefSeq to annotate a vcf file containing 6422 snps and 1300 InDels. I got:

"WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS": 682  
 "INFO_REALIGN_3_PRIME": 1339

I just built the database from * .gtf available in the seqs_for_alignment_pipelines directory of NCBI FTP. I got the same number of WARNING:

"WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS": 682  
 "INFO_REALIGN_3_PRIME": 60

While with the version of UCSC (http://hgdownload.cse.ucsc.edu) I found less warning number with only 92. I find that a little strange. I do not know if I use the annotation functional with the UCSC version it will bias the prediction ????

ADD REPLY • link 2.5 years ago by Adham • 0

0

Entering edit mode

It's kind of hard to say what's going on without digging into the details and looking at specific examples. I don't have a lot of experience with snpEff to be able to hazard a guess.

That said, I expect that you will see different set of results based on which version of annotation you use since the number of transcripts in each of these GTF files may be quite different. In the case of UCSC, you may want to confirm that they include both known (accessions with NM_/NR_ prefix) as well as the model (accessions with XM_/XR_ prefix) RefSeq transcripts.

ADD REPLY • link 2.5 years ago by vkkodali_ncbi ★ 3.7k