Programatically retrieving CLINVAR records, with the same detail level as it is seen in the web interface
5
1
Entering edit mode
6.2 years ago
ersenkavak ▴ 10

Hello,

The problem I am having is to reach the Clinical Assertsions Records under the ClinVar variation records. Specifically, I am trying to retrieve the Submission(s) and associated pubmed_ids related to some variants (some people like to call it allele).

If we take the following variant page as an example

http://www.ncbi.nlm.nih.gov/clinvar/variation/48074/

And, if we define a small variation with genomeversion_chromosome_position_refbase_altbase

Clinvar provides several FTP dumps, but for a couple of reasons we do not prefer to use them, rather try to fetch a json output from eSummary/eFetch or eSearch

For the above variant definition, one can use the eSearch API, by giving the chromosomal cordinates and retrieve the variationID as follows (this is a real variation record, so following the links would work)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=clinvar&term=1[chr]+AND+156104629:156104629[chrpos37]&retmode=json

the variation ID which is fetched from the above JSON will then be used to retrieve the variation record from the clinvar as follows:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=clinvar&id=48074&retmode=json

The problem is that, the latter will give me supporting_submissions dictionary key, which contains SCV00000567 like IDs, but nothing else. And, this is where I am stuck now, how can I go forward to fetch the details of these submissions, such as pathogenicity, variation or the supporting Pubmed Ids ?

ncbi API • 5.0k views
0
Entering edit mode

Just wanted to point out that the FTP VCFs and the variants on the clinvar webpage are not in sync and the FTP files have much less variants in them. If you did not want to use the NCBI downloads, you can also download from UCSC, which I think is much more comprehensive, better curated and in sync with the web version and gives all the info you need.

0
Entering edit mode

EDIT: I added my reply at the bottom, where it belongs, not as a side comment.

4
Entering edit mode
6.2 years ago
dandan ▴ 370

I work a SolveBio, we provide reference data for bioinformaticians and genetic/genomic diagnostics, and this sort of programmatic access to data is EXACTLY what we do!

I've thrown up a quick ipython notebook script to show you how to do this exact query on SolveBio - http://nbviewer.ipython.org/gist/dandanxu/bb11cf23a513f3879b8a

You can sign up for an account with SolveBio (it's totally free for low-volume users) and try it out.

Let me know if this helps. Would love to get some feedback too if you try it out.


1
Entering edit mode
6.2 years ago

There is no easy way.

I would do this using java and a local copy of the XML clinvar :

generate a parser from the XML schema: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/xsd_public/ (see How To Get A Dbsnp Data From Java? )

peek each entry and put each record in a relation database or a key/value database ( berkeleydb)

create a tool which retrieve a XML record from a given id, convert the output to HTML using XSLT.

1
Entering edit mode
3.6 years ago

I'm replying 2 years after this has been posted, but it might prove helpful to other ppl with the same issue. I found this after playing around for a while with NCBI's API, which can be not-as-helpful as one might expect.

The quick answer is that you need to change the retmode=json to retmode=variation in the last URL and use efetch:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=clinvar&id=48074&rettype=variation

There you will find everything you see in the web interface. Sadly, it's an ugly af, unparsed XML, and you have to extract info from it. biopython does not help parsing this XML AFAIK, so you have to get your hands dirty with some XML parser in your language of choice. If you are a Python programmer, I recommend beautifulsoup. :/

0
Entering edit mode
6.2 years ago

You can use ANNOVAR to get decent ClinVar annotations.  They aren't as comprehensive as the website, but they include identifiers (such as RCV000064059.2) so that you can look up more information as needed.

0
Entering edit mode
6 months ago
ariel ▴ 140

In case someone stumbles across this, here is a very up-to-date resource:

https://github.com/macarthur-lab/clinvar

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5473414/