Question: Download The List Of All Ensembl Genes And Difference Between Refseq And Ensembl Genes
3
gravatar for Bioscientist
7.2 years ago by
Bioscientist1.6k
Bioscientist1.6k wrote:

This may sound similar to my previous questions, but I am confused and have many more questions.

  1. How can I download ALL genes from Ensembl/Biomart websites? (or use mysql?)

  2. I can also download Ensembl genes from UCSC; so what's the difference between Ensembl genes from Biomart and UCSC?

  3. I compared refseq gene(from UCSC genome browser) and Ensemble gene(from UCSC genome browser). The coordinates cat be slightly different (maybe a few hundred bp away). Will this lead to confusion? Basically I wanna find the overlapping regions between the identified structural variants and genes. I'm using g1k_37v as reference for alignments. So which gene lists should I use? refseq or ensembl?

thx

ensembl biomart ucsc • 7.3k views
ADD COMMENTlink modified 7.2 years ago by Giulietta - Ensembl Helpdesk1.2k • written 7.2 years ago by Bioscientist1.6k

When you say all genes? What organism do you mean? Or for all organisms?

ADD REPLYlink written 7.2 years ago by Steve Moss2.2k
4
gravatar for Pierre Lindenbaum
7.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

1) yes:

$ mysql -h ensembldb.ensembl.org -u anonymous -P 3306 -D homo_sapiens_core_47_36i  -e 'desc gene' 
+-------------------+------------------------------------------------------------------------------+------+-----+---------+----------------+
| Field             | Type                                                                         | Null | Key | Default | Extra          |
+-------------------+------------------------------------------------------------------------------+------+-----+---------+----------------+
| gene_id           | int(10) unsigned                                                             |      | PRI | NULL    | auto_increment |
| biotype           | varchar(40)                                                                  |      |     |         |                |
| analysis_id       | smallint(5) unsigned                                                         |      | MUL | 0       |                |
| seq_region_id     | int(10) unsigned                                                             |      | MUL | 0       |                |
| seq_region_start  | int(10) unsigned                                                             |      |     | 0       |                |
| seq_region_end    | int(10) unsigned                                                             |      |     | 0       |                |
| seq_region_strand | tinyint(2)                                                                   |      |     | 0       |                |
| display_xref_id   | int(10) unsigned                                                             | YES  | MUL | NULL    |                |
| source            | varchar(20)                                                                  |      |     |         |                |
| status            | enum('KNOWN','NOVEL','PUTATIVE','PREDICTED','KNOWN_BY_PROJECTION','UNKNOWN') | YES  |     | NULL    |                |
| description       | text                                                                         | YES  |     | NULL    |                |
| is_current        | tinyint(1)                                                                   |      |     | 1       |                |
+-------------------+------------------------------------------------------------------------------+------+-----+---------+----------------+

2) yes:

$ mysql  --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'desc ensGene'
+--------------+------------------------------------+------+-----+---------+-------+
| Field        | Type                               | Null | Key | Default | Extra |
+--------------+------------------------------------+------+-----+---------+-------+
| bin          | smallint(5) unsigned               | NO   |     |         |       |
| name         | varchar(255)                       | NO   | MUL |         |       |
| chrom        | varchar(255)                       | NO   | MUL |         |       |
| strand       | char(1)                            | NO   |     |         |       |
| txStart      | int(10) unsigned                   | NO   |     |         |       |
| txEnd        | int(10) unsigned                   | NO   |     |         |       |
| cdsStart     | int(10) unsigned                   | NO   |     |         |       |
| cdsEnd       | int(10) unsigned                   | NO   |     |         |       |
| exonCount    | int(10) unsigned                   | NO   |     |         |       |
| exonStarts   | longblob                           | NO   |     |         |       |
| exonEnds     | longblob                           | NO   |     |         |       |
| score        | int(11)                            | YES  |     | NULL    |       |
| name2        | varchar(255)                       | NO   | MUL |         |       |
| cdsStartStat | enum('none','unk','incmpl','cmpl') | NO   |     |         |       |
| cdsEndStat   | enum('none','unk','incmpl','cmpl') | NO   |     |         |       |
| exonFrames   | longblob                           | NO   |     |         |       |
+--------------+------------------------------------+------+-----+---------+-------+

3) refseq

Reference Sequence (RefSeq) database is a collection of genomic, transcript and protein sequence records. These records are selected and curated from public sequence archives and represent a significant reduction in redundancy compared to the volume of data archived by the International Nucleotide Sequence Database Collaboration.

is not ucsc genes:

Known Genes dataset is constructed by a fully automated process, based on protein data from Swiss-Prot/TrEMBL (UniProt) and the associated mRNA data from Genbank.

There are many programs available to find the overlap between a gene and a varation. For example , as far as I remember(?), SnpEff offers both solutions: working with knownGenes or Ensembl.

ADD COMMENTlink written 7.2 years ago by Pierre Lindenbaum118k

Thx Pierre, but what's the different between two mysql commands?

ADD REPLYlink written 7.2 years ago by Bioscientist1.6k

The mysql commands can only give me the pictures shown by you; but how to retrieve the genes I want?thx

ADD REPLYlink written 7.2 years ago by Bioscientist1.6k

see an example here: http://biostar.stackexchange.com/questions/3121

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum118k
2
gravatar for Pascal
7.2 years ago by
Pascal1.4k
Barcelona
Pascal1.4k wrote:

I don't know if it helps, but whe you have the genes and Structural Variants as BED files you may use BEDtools (command intersectBed for instance) to find intersections.

ADD COMMENTlink written 7.2 years ago by Pascal1.4k
2
gravatar for Giulietta - Ensembl Helpdesk
7.2 years ago by
Cambridge, UK

Hi,

1) You can download all Ensembl genes directly from our ftp site:

http://www.ensembl.org/info/data/ftp/index.html

in addition to the Perl API or MySQL, as discussed. BioMart is better for downloading sets of genes, but tends to timeout when handling large data queries like a whole genome query. Find out how to use BioMart from this video tutorial:

http://www.ensembl.org/Help/Movie?id=189

2) The Ensembl genes you get from Ensembl, BioMart, and UCSC should be the same- however, UCSC does not update the Ensembl gene set as frequently as we do, so you're better off querying Ensembl directly for the latest updates.

3) Ensembl genes are based partly on RefSeq, partly on UniProt, and partly on Havana manual annotation. So, they may not exactly equal RefSeq genes, but have incorporated that data into the genebuild analysis. For more on the genebuild, have a look here:

http://www.ensembl.org/info/docs/genebuild/index.html

Or feel free to email us at helpdesk@ensembl.org with these types of questions, or for more information.

Hope that helps!

ADD COMMENTlink written 7.2 years ago by Giulietta - Ensembl Helpdesk1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1070 users visited in the last hour