Question: Retrieve Chromosome Number And Position From Gene Id In Danio Rerio
0
gravatar for Eric Normandeau
8.5 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Hi,

I'm working on a project in which I am interested to know where the proteins for which I have a nucleotide sequence in one fish species are found (chromosome and position) on the Danio rerio (zebrafish) genome. I blast my sequences against the Danio rerio transcriptome, extracted from the 'nr' database and, I then get geneIDs in the following format:

 gi|47087391|ref|NP_998590.1|
 gi|56090491|ref|NP_001007792.1|
 gi|169154248|emb|CAQ15172.1|
 gi|189523697|ref|XP_001341635.2|
 gi|189526610|ref|XP_687146.3|

From these, I would like to know the chromosome number and position on the chromosome of these genes on the Danio rerio genome (Zv9). Given that I have close to a thousand of these IDs, I want this process to be automated.

I can browse the zebrafish genome on different genome browsers, but how can I automate my search?

Many thanks

genome search • 5.9k views
ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Eric Normandeau10k

If it's the nr database, then those are not "gene IDs". The gi is a unique identifier for the protein database; the second part is a protein accession.

ADD REPLYlink written 8.5 years ago by Neilfws48k

Ok, noted. Given your answer, I looked for another option and found what I needed. I'll post it as an answer.

ADD REPLYlink written 8.5 years ago by Eric Normandeau10k
2
gravatar for Neilfws
8.5 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

This is actually quite a tricky problem, for several reasons.

  1. Your first identifier is a protein GI - a unique identifier used by NCBI Entrez. There's no simple way to go from a GI to chromosomal location using NCBI data or services.
  2. The second identifier is a protein accession, but these link to several different databases. For example NP_ is Refseq, XP_ is Refseq predicted, CAQ15172.1 is EMBL. This makes it difficult to query services using e.g. BioMart, unless you run separate queries for each type of accession.
  3. Your main problem though, is that you are using proteins to get to nucleotide data.

If I were using BLAST for this purpose I would:

  1. Download the nucleotide sequences of D. rerio chromosomes
  2. Format them as a BLAST database
  3. BLAST search using tblastn (if my queries were protein sequences) or blastn (if my queries were transcript sequences)

And then my BLAST report would contain chromosome and location.

ADD COMMENTlink written 8.5 years ago by Neilfws48k
1
gravatar for Eric Normandeau
8.5 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

Given the insights and recommendations from @neilfws, I used ensemble to retrieve all the protein sequences from Danio rerio in fasta format. This solution is better for me than blasting on the nucleotide sequences, which I had already done, since I really only want to blast on coding regions. Moreover, the sequence names now contain the information I need, ie: the chromosome number to which they belong, as opposed to what I was getting from my Danio subset of the 'nr' database.

I can now blast my sequences using blastx and know on what chromosome they hit.

Thanks for the suggestions!

ADD COMMENTlink written 8.5 years ago by Eric Normandeau10k
0
gravatar for Damian Kao
8.5 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can download the genebank format file for those sequences and extract the feature information. For example one of the genes you listed: http://www.ncbi.nlm.nih.gov/protein/CAQ15172.1

If you scroll down in the genebank file, you can see there are feature information and the source of this protein is in chromosome 9.

You can use BioPerl's [?]efetch[?] module to download the genebank files. And then use [?]seqIO module[?] to parse the genebank files to extract the feature information.

ADD COMMENTlink written 8.5 years ago by Damian Kao15k

Problem with this approach is that the coordinates in the file are not the chromosomal coordinates. Also, this is a Genpept file (not Genbank) - also it's Genbank, not genebank :)

ADD REPLYlink written 8.5 years ago by Neilfws48k

Problem with this approach: the NCBI link in the answer is a Genpept file, not Genbank, which means the coordinates are for the protein sequence, not the chromosome. Also note it's Genbank, not genebank.

ADD REPLYlink written 8.5 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1034 users visited in the last hour