Dear community,
I use the dbNSFP file locally and have downloaded it as described on the official website, sorted it and indexed it using Tabix. Now when I use the VEP command line tool offline with the dbNSFP plugin, I receive all the biologically possible variants and associated pathogenicity scores calculated cleanly for my gene.
However, when I retrieve the scores directly from the dbNSFP file using a tabix query without using the VEP script & dbNSFP plugin, certain chromosomal sections are missing. Do you have any suggestion what could be the reason for this?
I am using R for my direct query with the seqminer package and the following command:
tabix.read.table(file, position, col.names = TRUE, stringsAsFactors = FALSE)
„file" shows the path to my dbNSFP file and "position" gives the chromosome and the position on the chromosome.
Maybe there is a simple solution to my problem?
Thanks for your help!
Can you please give us more details about this? What are the regions/variants with missing data? If you'd prefer not to share your data in a public forum, please email to us at helpdesk@ensembl.org.
With pleasure! When I, for example, query the regions 17:47941571-47949308 for the PNPO gene (ENST00000642017.2) via the VEP script using the dbNSFP plugin and the command line, I retrieve the full set of scores for each mutated amino acid position. A similar query using tabix returns a result with the region of amino acids 166-245 missing. Since the dbNSFP file is sorted, I don't understand what is causing the failure.
Is it possible that within the file under the region I queried, lines appear again that contain information for the same region? In the meantime I have also re-sorted and indexed individual chromosome files of the dbNSFP data - unfortunately with the same result.
--
I discovered that different transcripts are separated by ";" in the Ensembl_transctiptid column - could this be responsible for the problem?