Question

ClinVar local install

0

Entering edit mode

8.0 years ago

win ▴ 970

Hi all. I want to do downstream analysis of VCF files and need to interface my data with ClinVar. I found ClinVar in VCF format here: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/common_all_20151104.vcf.gz

Planning to convert this VCF file into Hive format.

I wanted to know if anyone has suggestions to setup a local instance of ClinVar that can be easily queried?

Thanks in advance.

ClinVar • 3.7k views

ADD COMMENT • link updated 8.0 years ago by DG 7.3k • written 8.0 years ago by win ▴ 970

0

Entering edit mode

Planning to convert this VCF file into Hive format.

you meant you want to insert this into Apache Hive and you need some Hive insert statements ?

I wanted to know if anyone has suggestions to setup a local instance of ClinVar that can be easily queried?

what do you want to query ? how is it different from inserting it in Hive ?

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Yes, insert this into Apache Hive.

I wanted to know if using Hive is an option or is there a better way of doing this?

ADD REPLY • link 8.0 years ago by win ▴ 970

0

Entering edit mode

it depends your needs / what you want to query.

ADD REPLY • link 8.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I want to identify pathological variants

ADD REPLY • link 8.0 years ago by win ▴ 970

score 1 · Answer 1 · 2016-04-12

I recommend you use Ensembl VEP. Among other many relevant things to catalog a variant as damaging, you will get CLIN_SIG or ClinVar significance classifier. Using VEP's web interface for small jobs will work just fine. For larger jobs and for more customizable usage you can install it locally. Not hard if you go the Ensembl virtual machine route.

score 1 · Answer 2 · 2016-04-12

1

Entering edit mode

8.0 years ago

DG 7.3k

Do you want to do analysis of Clinvar data, or do you want to annotate your own data with Clinvar and analyze that? Your question doesn't make that clear but I am assuming you want to do the later, as just parsing through the Clinvar VCF for pathogenic variants doesn't require much. If you want to annotate your own VCFs with Clinvar data I would suggest a few options. As @Carlos Borroto suggests you can use VEP. You could also use GEMINI, which will add ClinVar annotations to VCF files that are already annotated with VEP or with snpEff and convert it into a database (sqlite3 by default but moving to support with SQL-based database backends) which you can then query and explore. If you just want to add ClinVar annotations to your VCF directly, and then query the VCF itself however you like there is also VCFAnno, which is a fast annotator of VCFs based on any listed VCFs or BED files as you choose. It is fast, powerful, and flexible and basically accomplishes a lot of what GEMINI does without converting to a database at the end. I am now using a combination of snpEff and VCFAnno to add custom annotations to my VCF files, I am then storing my variants in a Cassandra database.

ADD COMMENT • link 8.0 years ago by DG 7.3k

0

Entering edit mode

Thanks for the excellent feedback. As you correctly pointed out, what I am really after is to find pathological variants in my VCF. As suggested I could annotate the VCF with ClinVar data and simply query the VCF or interface the VCF with locally installed ClinVar database and find the variants that way.

I have used VEP in the past and it works well.

Is there a standard pipeline to detect pathological variants?

ADD REPLY • link 8.0 years ago by win ▴ 970

0

Entering edit mode

Sort of. It depends on what sort of disease project you're working on. If you're just looking for known pathological variants from Clinvar than annotating a VCF with up to date Clinvar VCF and querying for anything with a pathological flag will mostly do the trick. With the exception that not all known pathological variants will necessarily be in Clinvar (but they mostly should) and there will be some things annotated incorrectly as pathological as well.

If you are also looking for new potentially causal variants in a disease cohort it gets more complicated. Basically you want to follow something like the GATK's Best-Practices analysis pipeline and then use VEP, snpEff, GEMINI, Annovar, or whatever annotator or annotators you prefer and then start doing appropriate filtering of your variants. Paying particular attention to the variant segregating appropriately among affected/unaffected individuals in your family and the population-based minor allele frequency. Common cutoff's are 1%, 0.05%, or even 0.01% depending on how conservative you want to be and the frequency of the disease.