Question

dbNSFP VEP plugin and tabix query give different results

0

Entering edit mode

4.4 years ago

heikobrennenstuhl • 0

Dear community,

I use the dbNSFP file locally and have downloaded it as described on the official website, sorted it and indexed it using Tabix. Now when I use the VEP command line tool offline with the dbNSFP plugin, I receive all the biologically possible variants and associated pathogenicity scores calculated cleanly for my gene.

However, when I retrieve the scores directly from the dbNSFP file using a tabix query without using the VEP script & dbNSFP plugin, certain chromosomal sections are missing. Do you have any suggestion what could be the reason for this?

I am using R for my direct query with the seqminer package and the following command:

tabix.read.table(file, position, col.names = TRUE, stringsAsFactors = FALSE)

„file" shows the path to my dbNSFP file and "position" gives the chromosome and the position on the chromosome.

Maybe there is a simple solution to my problem?

Thanks for your help!

dbNSFP tabix VEP • 1.9k views

ADD COMMENT • link updated 4.4 years ago by Emily 24k • written 4.4 years ago by heikobrennenstuhl • 0

0

Entering edit mode

Can you please give us more details about this? What are the regions/variants with missing data? If you'd prefer not to share your data in a public forum, please email to us at helpdesk@ensembl.org.

ADD REPLY • link 4.4 years ago by Emily 24k

0

Entering edit mode

With pleasure! When I, for example, query the regions 17:47941571-47949308 for the PNPO gene (ENST00000642017.2) via the VEP script using the dbNSFP plugin and the command line, I retrieve the full set of scores for each mutated amino acid position. A similar query using tabix returns a result with the region of amino acids 166-245 missing. Since the dbNSFP file is sorted, I don't understand what is causing the failure.

Is it possible that within the file under the region I queried, lines appear again that contain information for the same region? In the meantime I have also re-sorted and indexed individual chromosome files of the dbNSFP data - unfortunately with the same result.

--

I discovered that different transcripts are separated by ";" in the Ensembl_transctiptid column - could this be responsible for the problem?

ADD REPLY • link 4.4 years ago by heikobrennenstuhl • 0

score 0 · Answer 1 · 2021-01-28

0

Entering edit mode

4.4 years ago

Emily 24k

We're not familiar with the R package you're using with tabix. Could you try running tabix directly, eg: tabix dbNSFP4.0a_grch38.gz 17:47941571-47949308

ADD COMMENT • link 4.4 years ago by Emily 24k

0

Entering edit mode

Dear Emily,

tabix runs smoothly and gives me the same dataframe...

Unfortunately, the transcripts and the corresponding scores are displayed each in columns separated by semicolons. I had to write a function that reads the file line by line and separates the transcripts with the corresponding scores on multiple new lines. This gives me my desired result, but unfortunately the calculation time is quite long.

ADD REPLY • link 4.4 years ago by heikobrennenstuhl • 0

0

Entering edit mode

If you get the data with tabix directly but it's missing with the R package, that suggests something strange with you R package. I suggest you speak to the people who made it.

ADD REPLY • link 4.4 years ago by Emily 24k