4
1
Entering edit mode
10.0 years ago
Pierre ▴ 20

Hello all,

I just can't figure out an easy way to download all the gene sequences of the human genome defined by the database NCBI gene (http://www.ncbi.nlm.nih.gov/gene)

Any idea ? Why they don't just put a fasta file on their ftp ?

Thanks,

Pierre.

4
Entering edit mode
10.0 years ago
Neilfws 49k

As others have pointed out: despite its name, the "gene" database is not the appropriate resource for retrieving the data that you want.

If you're looking for a fasta format file to download in the NCBI FTP site, why don't you start from the top level and explore it? I just did so and I found:

ftp.ncbi.nih.gov                                # top level
ftp.ncbi.nih.gov/genomes                        # genomes - that looks useful!
ftp.ncbi.nih.gov/genomes/H_sapiens/             # even more useful
ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/RNA/   # Aha!


File rna.fa.gz looks like the one.

2
Entering edit mode
10.0 years ago

ENSEMBL maintains the genome annotations, look here: http://www.ensembl.org/info/data/ftp/index.html

As there are many possible varieties of how you define "gene" (CDS, transcript, exons only) there would be many different varieties of fasta files. You can try ENSEMBL biomart with the following query to give you nucleotide sequence of protein coding regions with ensembl gene id as header id :...biomart link

1
Entering edit mode

I don't want to use ensembl. NCBI define starts and ends on contigs/chromosomes in the gene database, I want to find out a way to retrieve the corresponding sequences, that's all.

0
Entering edit mode

How about clicking on "FASTA"/"GenBank" in the "Genomic regions, transcripts, and products" then? ;-)

0
Entering edit mode

We are not enough in the team to do it ~42,000 times.

0
Entering edit mode

0
Entering edit mode

0
Entering edit mode

well, you will find out that the coordinates will be based on the ensembl annotation. I showed you a way, if you don't want use it, it's your problem.

0
Entering edit mode

Thank all for your help. Anyway they associate each gene symbol with coordinates (which definitions can be arguable, I'm ok with that, but it looks like some kind of refseq mrna clustering + human curation and that's exactly what I'm looking for). So as they define this set of coordinates I don't understand why they don't just make a fasta file like any other dataset. I managed to code a script that parse the annotation field of the summary file and that download the sequences with eUtils. Thanks again for your help.

1
Entering edit mode
10.0 years ago

You might have the wrong database. The NCBI Gene database doesn't seem to store any sequences, they are rather stored in the Nucleotide database. Depending on what you want to do, Ensembl may also be more useful.

0
Entering edit mode

Could I ask for your opinion please? My aim is to download all of the longest canonical transcripts for the protein coding genes (i.e. coding sequences, not proteins) of Chlorocebus from NCBI. Using Ensembl is not an option (I'm very familiar with Ensembl, so I would have preferred to have used it if possible).

I originally posted this question.

As you can see in the comments, GenoMax2 kindly suggested to "Choose "Send to file" and then "tabular text view" to download full table. Cut the interval columns out for the locations and then use getfasta from bedtools to recover the DNA sequence.".

Then I ran into a problem with bed tools, and I asked a question here (you can see my question is a comment underneath the original question). You can see in this post that (1) I am really trying to understand how to get the data that I need using a variety of methods and (2) That there is obviously something wrong.

Could you possibly explain, simply and specifically, the steps you would use to download the longest canonical transcripts for protein coding genes (i.e. coding sequences and not protein) for Chlorocebus_sabeus_1.1? I just don't understand what I'm doing wrong, but I know NCBI is a well used database, and I've gone through the help pages and emailed NCBI (who suggested I use Entrez Direct, which was the reason that I originally asked for help in BioStars). Thanks.

Many thanks.

0
Entering edit mode
10.0 years ago
Dror ▴ 280

What you are looking for is actually the refseq datasets. All the entrez gene data is directly derived from the refseq database. So, depend if you are looking for genomic, CDs or protein data, you should look for the relevant refseq data. http://www.ncbi.nlm.nih.gov/RefSeq/key.html You can go to the relevant database interface, for example "protein" and limit the query to the relevant refseq data. Then download all in fasta format from the upper left menu.