Get clinVar info for SNP with VEP
1
1
Entering edit mode
2.2 years ago
Eugene A ▴ 170

Hello everyone, I'm trying to annotate a vcf file with VEP and get the info from ClinVar db

The clinVar database has a field corresponding to ACMG (if I understand this correctly), for example mutation 140749365 in BRAF has a tag "Pathogenic" in ClinVar ( https://www.ncbi.nlm.nih.gov/clinvar?term=((174176[AlleleID])OR(29020[AlleleID])) ). I assume that this info could be integrated in the VEP output. Nevertheless, I'm getting the empty CLIN_SIG field, even using flag "--everything" (as well as flag --check_existing) for some test cases. I'm not sure what is the problem here and actually out of good guesses :(

the input line in the vcf file, which provides empty CLIN_SIG filed:

chr7    140749365       .       G       A       50      PASS


On the other hand, input line like this:

chr7    140753339       .       G       A       50      PASS


provides me with correctly filled CLIN_SIG field

Also other output fields of VEP are quite cryptic: for example PHENO field has a value of "1&1,A" which suppose to be linked to "Existing_variation" field ("CM092083&COSV56058494"), but I'm not sure how to interpret it (seems that it is key -> value codhttps://www.ncbi.nlm.nih.gov/clinvar?term=((174176[AlleleID])OR(29020[AlleleID]))e, but I do not know where it is described)

Hope for some ideas! Best wishes, Eugene

SNP snp VEP vep • 1.9k views
2
Entering edit mode
2.2 years ago
Collin ▴ 1000

A lot of variants will likely produce an empty ClinVar field because ClinVar pathogenicity assertions are only available for a small number of variants. However, you could cross reference annotations with another variant annotator just to be sure. For example in OpenCRAVAT, you could submit directly to the webserver or run the command line tool.

0
Entering edit mode

I do understand that most of variants does not have the clinical signifficants - the problem is, that both variants in my example have (according to info from ClinVar website). But I'm getting correct (again according to ClinVar website) annotation for only one of them. I initially thought that it might be due to outdated version of ClinVar I'm using with VEP, but crating a custom annotation with the latest ClinVar release did not change anyting (

1
Entering edit mode

At least in my hands your first variant is generating a synonymous variant (BRAF D638D), which is not the same thing as the listed pathogenic variant.

1
Entering edit mode

If you want to check the actual 'raw' data, it is located here: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/

1
Entering edit mode

My bad - I was generating this toy example by hand from ClinVar and did not notice that braf is on the opposite strand, so nucleotide change in ClinVar have to be reversecomplement before puting in vcf :(

0
Entering edit mode

Hi, I'm investigating open cravat - thank you for pointing to it! Moreover, I probably will insert it in the pipeline I am building for the visualization/filtration. Also I've noticed some strange thing concerning coding/non-coding visualizations.

My test vcf currently contain only non-coding SNP according to VEP (also if I filter with CRAVAT based on "coding" field in "filter" tab I'll get 0/52 SNP.)

BUT on the summary tab CRAVAT draws the following diagrams:

It seems for me that CRAVAT assigns "intergenic" SNP to "coding"?

Best, Eugene

1
Entering edit mode

Hi, a lead architect of OpenCRAVAT here. It's a bug in "Coding vs Noncoding Summary" widget. Sequence ontology terms have been evolving and the widget did not catch up with the change. We'll fix it and publish a fixed version shortly.

0
Entering edit mode

Hi, thanks for the clarification I thought that this is a bug and glad that the tools is improving and evolving, it's actually a really cool soft !

having a chance wanna ask about tool performance: I'm running tool on a cluster and I see that by default at the mapping step CRAVT is using all cores, all other steps seems to be running on one core. Is it possible to speed things up? Load everything in memory or something like it (I thing that the disk speed is a limit there?)?

Eugene

1
Entering edit mode

Thanks. In OpenCRAVAT (OC) 1.8.0, if multiple annotators are requested to be run, they can be run on multiple cores, but still one annotator will be run on one core. And, other steps such as aggregator are run on one core. OC started as a single core program and we have been adding multicore support to more steps of it, so fully utilizing multicores in all of its steps is definitely the direction. By the way, the maximum number of cores to use can be set in OC's setting. See "number of concurrent annotations per job" in https://github.com/KarchinLab/open-cravat/wiki/5.-GUI-usage#system-setting.

Indeed, loading the annotation database into memory can improve annotation speed, since most annotators' speed is I/O-bound. Some annotators' database is small enough for loading into memory and some have too big databases. If you can let me know which annotators you are using, we can examine them.

0
Entering edit mode

My experience is similar in that disk speed can limit the speed of annotation. If you have a machine with a SSD you'll definitely get a speed boost. See the wiki for more detail: https://github.com/KarchinLab/open-cravat/wiki#system-capabilities .

1
Entering edit mode

Please run oc module install wgcodingvsnoncodingsummary to install the fixed version of the widget and see if the problem is gone.

0
Entering edit mode

Hi, I'd like to ask one more question concerning the Cravat system: I need to annotate my variants with exon number, in particular I'm interested if the given SNP happened in the first or the last exon of the transcript (currently I'm implementing InterVar code to work with openCravat output to get an ACMG annotation).

I can get the list of exons for all transcripts from ensemble biomart (https://www.ensembl.org/biomart/martview/), and then prepare the a separate annotator with exon structure for this, but it turns out that Cravat uses a bit outdated transcript versions (for example ENST00000379389.4 is no longer in a database or ENST00000379370 has version .6 in Cravat and .7 on the website).

It is most likely not a problem for my purpose in majority of cases, but I'm wondering: 1) Are there any ways to manually update cravat for the newer version of ensemble? (As far as I can guess some files in the module common or mapper, have to be updated?) 2) Are there any other ways to get an exon structure from the Cravat (may be it is already there, but I've missed it)

Best, Eugene

0
Entering edit mode

Cannot really delve into the details without having the data in front of me; however, there are many instances whereby a variant can be regarded as both intergenic and also 'coding'. Think of splice-isoforms, which can vary in length by megabases. Some isoforms even span multiple genes. The 'fluid' human genome is a microcosm of evolution in its own right - every eventuality exists.

0
Entering edit mode

I do understand how one SNP can be intergenic and coding, problem is in inconcistancy of open CRAVAT (may be I did not explain it clearly enough) here the filtering tab:

I really do not understand how could it be connected to previose diagram. The only option I see is a different meaning of word "coding" in "Summary" and "Filter" tab

0
Entering edit mode

Probably a question for CRAVAT