Download The List Of All Ensembl Genes And Difference Between Refseq And Ensembl Genes
3
3
Entering edit mode
12.3 years ago
Bioscientist ★ 1.7k

This may sound similar to my previous questions, but I am confused and have many more questions.

  1. How can I download ALL genes from Ensembl/Biomart websites? (or use mysql?)

  2. I can also download Ensembl genes from UCSC; so what's the difference between Ensembl genes from Biomart and UCSC?

  3. I compared refseq gene(from UCSC genome browser) and Ensemble gene(from UCSC genome browser). The coordinates cat be slightly different (maybe a few hundred bp away). Will this lead to confusion? Basically I wanna find the overlapping regions between the identified structural variants and genes. I'm using g1k_37v as reference for alignments. So which gene lists should I use? refseq or ensembl?

thx

ensembl biomart ucsc • 11k views
ADD COMMENT
0
Entering edit mode

When you say all genes? What organism do you mean? Or for all organisms?

ADD REPLY
4
Entering edit mode
12.3 years ago

1) yes:

$ mysql -h ensembldb.ensembl.org -u anonymous -P 3306 -D homo_sapiens_core_47_36i  -e 'desc gene' 
+-------------------+------------------------------------------------------------------------------+------+-----+---------+----------------+
| Field             | Type                                                                         | Null | Key | Default | Extra          |
+-------------------+------------------------------------------------------------------------------+------+-----+---------+----------------+
| gene_id           | int(10) unsigned                                                             |      | PRI | NULL    | auto_increment |
| biotype           | varchar(40)                                                                  |      |     |         |                |
| analysis_id       | smallint(5) unsigned                                                         |      | MUL | 0       |                |
| seq_region_id     | int(10) unsigned                                                             |      | MUL | 0       |                |
| seq_region_start  | int(10) unsigned                                                             |      |     | 0       |                |
| seq_region_end    | int(10) unsigned                                                             |      |     | 0       |                |
| seq_region_strand | tinyint(2)                                                                   |      |     | 0       |                |
| display_xref_id   | int(10) unsigned                                                             | YES  | MUL | NULL    |                |
| source            | varchar(20)                                                                  |      |     |         |                |
| status            | enum('KNOWN','NOVEL','PUTATIVE','PREDICTED','KNOWN_BY_PROJECTION','UNKNOWN') | YES  |     | NULL    |                |
| description       | text                                                                         | YES  |     | NULL    |                |
| is_current        | tinyint(1)                                                                   |      |     | 1       |                |
+-------------------+------------------------------------------------------------------------------+------+-----+---------+----------------+

2) yes:

3) refseq

Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration.

is not ucsc genes:

Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank.

There are many programs available to find the overlap between a gene and a variation. For example , as far as I remember(?), SnpEff offers both solutions: working with knownGenes or Ensembl.

ADD COMMENT
0
Entering edit mode

Thx Pierre, but what's the different between two mysql commands?

ADD REPLY
0
Entering edit mode

The mysql commands can only give me the pictures shown by you; but how to retrieve the genes I want?thx

ADD REPLY
0
Entering edit mode

See an example here.

ADD REPLY
3
Entering edit mode
12.3 years ago

Hi,

1) You can download all Ensembl genes directly from our ftp site:

http://www.ensembl.org/info/data/ftp/index.html

in addition to the Perl API or MySQL, as discussed. BioMart is better for downloading sets of genes, but tends to timeout when handling large data queries like a whole genome query. Find out how to use BioMart from this video tutorial:

http://www.ensembl.org/Help/Movie?id=189

2) The Ensembl genes you get from Ensembl, BioMart, and UCSC should be the same- however, UCSC does not update the Ensembl gene set as frequently as we do, so you're better off querying Ensembl directly for the latest updates.

3) Ensembl genes are based partly on RefSeq, partly on UniProt, and partly on Havana manual annotation. So, they may not exactly equal RefSeq genes, but have incorporated that data into the genebuild analysis. For more on the genebuild, have a look here:

http://www.ensembl.org/info/docs/genebuild/index.html

Or feel free to email us at helpdesk@ensembl.org with these types of questions, or for more information.

Hope that helps!

ADD COMMENT
2
Entering edit mode
12.3 years ago
Pascal ★ 1.5k

I don't know if it helps, but whe you have the genes and Structural Variants as BED files you may use BEDtools (command intersectBed for instance) to find intersections.

ADD COMMENT

Login before adding your answer.

Traffic: 1891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6