Question: How To Download Gene Sequences From Ncbi Gene
1
gravatar for Pierre
7.8 years ago by
Pierre20
Pierre20 wrote:

Hello all,

I just can't figure out an easy way to download all the gene sequences of the human genome defined by the database NCBI gene (http://www.ncbi.nlm.nih.gov/gene)

Any idea ? Why they don't just put a fasta file on their ftp ?

Thanks,

Pierre.

ncbi gene download fasta • 24k views
ADD COMMENTlink written 7.8 years ago by Pierre20
4
gravatar for Neilfws
7.8 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

As others have pointed out: despite its name, the "gene" database is not the appropriate resource for retrieving the data that you want.

If you're looking for a fasta format file to download in the NCBI FTP site, why don't you start from the top level and explore it? I just did so and I found:

ftp.ncbi.nih.gov                                # top level
ftp.ncbi.nih.gov/genomes                        # genomes - that looks useful!
ftp.ncbi.nih.gov/genomes/H_sapiens/             # even more useful
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/RNA/   # Aha!

File rna.fa.gz looks like the one.

ADD COMMENTlink written 7.8 years ago by Neilfws48k
2
gravatar for Michael Dondrup
7.8 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

ENSEMBL maintains the genome annotations, look here: http://www.ensembl.org/info/data/ftp/index.html

As there are many possible varieties of how you define "gene" (CDS, transcript, exons only) there would be many different varieties of fasta files. You can try ENSEMBL biomart with the following query to give you nucleotide sequence of protein coding regions with ensembl gene id as header id :...biomart link

ADD COMMENTlink written 7.8 years ago by Michael Dondrup46k
1

I don't want to use ensembl. NCBI define starts and ends on contigs/chromosomes in the gene database, I want to find out a way to retrieve the corresponding sequences, that's all.

ADD REPLYlink written 7.8 years ago by Pierre20

How about clicking on "FASTA"/"GenBank" in the "Genomic regions, transcripts, and products" then? ;-)

ADD REPLYlink written 7.8 years ago by Michael Schubert6.9k

We are not enough in the team to do it ~42,000 times.

ADD REPLYlink written 7.8 years ago by Pierre20

Then please update your question.

ADD REPLYlink written 7.8 years ago by Michael Schubert6.9k

Ok, overread the "all".

ADD REPLYlink written 7.8 years ago by Michael Schubert6.9k

well, you will find out that the coordinates will be based on the ensembl annotation. I showed you a way, if you don't want use it, it's your problem.

ADD REPLYlink written 7.8 years ago by Michael Dondrup46k

Thank all for your help. Anyway they associate each gene symbol with coordinates (which definitions can be arguable, I'm ok with that, but it looks like some kind of refseq mrna clustering + human curation and that's exactly what I'm looking for). So as they define this set of coordinates I don't understand why they don't just make a fasta file like any other dataset. I managed to code a script that parse the annotation field of the summary file and that download the sequences with eUtils. Thanks again for your help.

ADD REPLYlink written 7.8 years ago by Pierre20
1
gravatar for Michael Kuhn
7.8 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

You might have the wrong database. The NCBI Gene database doesn't seem to store any sequences, they are rather stored in the Nucleotide database. Depending on what you want to do, Ensembl may also be more useful.

ADD COMMENTlink written 7.8 years ago by Michael Kuhn5.0k

Could I ask for your opinion please? My aim is to download all of the longest canonical transcripts for the protein coding genes (i.e. coding sequences, not proteins) of Chlorocebus from NCBI. Using Ensembl is not an option (I'm very familiar with Ensembl, so I would have preferred to have used it if possible).

I originally posted this question.

As you can see in the comments, GenoMax2 kindly suggested to "Choose "Send to file" and then "tabular text view" to download full table. Cut the interval columns out for the locations and then use getfasta from bedtools to recover the DNA sequence.".

Then I ran into a problem with bed tools, and I asked a question here (you can see my question is a comment underneath the original question). You can see in this post that (1) I am really trying to understand how to get the data that I need using a variety of methods and (2) That there is obviously something wrong.

Could you possibly explain, simply and specifically, the steps you would use to download the longest canonical transcripts for protein coding genes (i.e. coding sequences and not protein) for Chlorocebus_sabeus_1.1? I just don't understand what I'm doing wrong, but I know NCBI is a well used database, and I've gone through the help pages and emailed NCBI (who suggested I use Entrez Direct, which was the reason that I originally asked for help in BioStars). Thanks.

Many thanks.

ADD REPLYlink written 3.2 years ago by Tom20
0
gravatar for Dror
7.8 years ago by
Dror280
Israel
Dror280 wrote:

What you are looking for is actually the refseq datasets. All the entrez gene data is directly derived from the refseq database. So, depend if you are looking for genomic, CDs or protein data, you should look for the relevant refseq data. http://www.ncbi.nlm.nih.gov/RefSeq/key.html You can go to the relevant database interface, for example "protein" and limit the query to the relevant refseq data. Then download all in fasta format from the upper left menu.

ADD COMMENTlink written 7.8 years ago by Dror280
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1288 users visited in the last hour