Question

Download The List Of All Ensembl Genes And Difference Between Refseq And Ensembl Genes

3

Entering edit mode

12.3 years ago

Bioscientist ★ 1.7k

This may sound similar to my previous questions, but I am confused and have many more questions.

How can I download ALL genes from Ensembl/Biomart websites? (or use mysql?)
I can also download Ensembl genes from UCSC; so what's the difference between Ensembl genes from Biomart and UCSC?
I compared refseq gene(from UCSC genome browser) and Ensemble gene(from UCSC genome browser). The coordinates cat be slightly different (maybe a few hundred bp away). Will this lead to confusion? Basically I wanna find the overlapping regions between the identified structural variants and genes. I'm using g1k_37v as reference for alignments. So which gene lists should I use? refseq or ensembl?

thx

ensembl biomart ucsc • 11k views

ADD COMMENT • link updated 12.3 years ago by Giulietta - Ensembl Helpdesk ★ 1.2k • written 12.3 years ago by Bioscientist ★ 1.7k

0

Entering edit mode

When you say all genes? What organism do you mean? Or for all organisms?

ADD REPLY • link 12.3 years ago by Steve Moss 2.3k

Ram · Answer 1 · 2012-01-18

1) yes:

$ mysql -h ensembldb.ensembl.org -u anonymous -P 3306 -D homo_sapiens_core_47_36i  -e 'desc gene' 
+-------------------+------------------------------------------------------------------------------+------+-----+---------+----------------+
| Field             | Type                                                                         | Null | Key | Default | Extra          |
+-------------------+------------------------------------------------------------------------------+------+-----+---------+----------------+
| gene_id           | int(10) unsigned                                                             |      | PRI | NULL    | auto_increment |
| biotype           | varchar(40)                                                                  |      |     |         |                |
| analysis_id       | smallint(5) unsigned                                                         |      | MUL | 0       |                |
| seq_region_id     | int(10) unsigned                                                             |      | MUL | 0       |                |
| seq_region_start  | int(10) unsigned                                                             |      |     | 0       |                |
| seq_region_end    | int(10) unsigned                                                             |      |     | 0       |                |
| seq_region_strand | tinyint(2)                                                                   |      |     | 0       |                |
| display_xref_id   | int(10) unsigned                                                             | YES  | MUL | NULL    |                |
| source            | varchar(20)                                                                  |      |     |         |                |
| status            | enum('KNOWN','NOVEL','PUTATIVE','PREDICTED','KNOWN_BY_PROJECTION','UNKNOWN') | YES  |     | NULL    |                |
| description       | text                                                                         | YES  |     | NULL    |                |
| is_current        | tinyint(1)                                                                   |      |     | 1       |                |
+-------------------+------------------------------------------------------------------------------+------+-----+---------+----------------+

2) yes:

3) refseq

Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration.

is not ucsc genes:

Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank.

There are many programs available to find the overlap between a gene and a variation. For example , as far as I remember(?), SnpEff offers both solutions: working with knownGenes or Ensembl.

score 3 · Answer 2 · 2012-01-20

Hi,

1) You can download all Ensembl genes directly from our ftp site:

http://www.ensembl.org/info/data/ftp/index.html

in addition to the Perl API or MySQL, as discussed. BioMart is better for downloading sets of genes, but tends to timeout when handling large data queries like a whole genome query. Find out how to use BioMart from this video tutorial:

http://www.ensembl.org/Help/Movie?id=189

2) The Ensembl genes you get from Ensembl, BioMart, and UCSC should be the same- however, UCSC does not update the Ensembl gene set as frequently as we do, so you're better off querying Ensembl directly for the latest updates.

3) Ensembl genes are based partly on RefSeq, partly on UniProt, and partly on Havana manual annotation. So, they may not exactly equal RefSeq genes, but have incorporated that data into the genebuild analysis. For more on the genebuild, have a look here:

http://www.ensembl.org/info/docs/genebuild/index.html

Or feel free to email us at helpdesk@ensembl.org with these types of questions, or for more information.

Hope that helps!

score 2 · Answer 3 · 2012-01-18

2

Entering edit mode

12.3 years ago

Pascal ★ 1.5k

I don't know if it helps, but whe you have the genes and Structural Variants as BED files you may use BEDtools (command intersectBed for instance) to find intersections.

ADD COMMENT • link 12.3 years ago by Pascal ★ 1.5k