Question: dbNSFP VEP plugin and tabix query give different results
0
gravatar for heikobrennenstuhl
5 weeks ago by
heikobrennenstuhl0 wrote:

Dear community,

I use the dbNSFP file locally and have downloaded it as described on the official website, sorted it and indexed it using Tabix. Now when I use the VEP command line tool offline with the dbNSFP plugin, I receive all the biologically possible variants and associated pathogenicity scores calculated cleanly for my gene.

However, when I retrieve the scores directly from the dbNSFP file using a tabix query without using the VEP script & dbNSFP plugin, certain chromosomal sections are missing. Do you have any suggestion what could be the reason for this?

I am using R for my direct query with the seqminer package and the following command:

tabix.read.table(file, position, col.names = TRUE, stringsAsFactors = FALSE)

„file" shows the path to my dbNSFP file and "position" gives the chromosome and the position on the chromosome.

Maybe there is a simple solution to my problem?

Thanks for your help!

dbnsfp vep tabix • 161 views
ADD COMMENTlink modified 4 weeks ago by Emily_Ensembl21k • written 5 weeks ago by heikobrennenstuhl0

Can you please give us more details about this? What are the regions/variants with missing data? If you'd prefer not to share your data in a public forum, please email to us at helpdesk@ensembl.org.

ADD REPLYlink written 5 weeks ago by Emily_Ensembl21k

With pleasure! When I, for example, query the regions 17:47941571-47949308 for the PNPO gene (ENST00000642017.2) via the VEP script using the dbNSFP plugin and the command line, I retrieve the full set of scores for each mutated amino acid position. A similar query using tabix returns a result with the region of amino acids 166-245 missing. Since the dbNSFP file is sorted, I don't understand what is causing the failure.

Is it possible that within the file under the region I queried, lines appear again that contain information for the same region? In the meantime I have also re-sorted and indexed individual chromosome files of the dbNSFP data - unfortunately with the same result.

--

I discovered that different transcripts are separated by ";" in the Ensembl_transctiptid column - could this be responsible for the problem?

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by heikobrennenstuhl0
0
gravatar for Emily_Ensembl
4 weeks ago by
Emily_Ensembl21k
EMBL-EBI
Emily_Ensembl21k wrote:

We're not familiar with the R package you're using with tabix. Could you try running tabix directly, eg: tabix dbNSFP4.0a_grch38.gz 17:47941571-47949308

ADD COMMENTlink written 4 weeks ago by Emily_Ensembl21k

Dear Emily,

tabix runs smoothly and gives me the same dataframe...

Unfortunately, the transcripts and the corresponding scores are displayed each in columns separated by semicolons. I had to write a function that reads the file line by line and separates the transcripts with the corresponding scores on multiple new lines. This gives me my desired result, but unfortunately the calculation time is quite long.

ADD REPLYlink written 4 weeks ago by heikobrennenstuhl0

If you get the data with tabix directly but it's missing with the R package, that suggests something strange with you R package. I suggest you speak to the people who made it.

ADD REPLYlink written 4 weeks ago by Emily_Ensembl21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2722 users visited in the last hour
_