I need to download a list of all human genes with their respective Ensembl gene name | transcription start site ..
4
1
Entering edit mode
10.2 years ago

Hello,

I'm new here and and I need help please!

I need to download a list of all human genes with their respective gene symbol | chromosome | strand | transcription start site | Txen | and Ensembl gene name,

Actually I'm using UCSC table and I get something like this:

#hg38.knownGene.name  hg38.knownGene.chrom  hg38.knownGene.strand  hg38.knownGene.txStart  hg38.knownGene.txEnd  hg38.kgXref.geneSymbol
uc001aaa.3            chr1                  +                      11873                   14409                 DDX11L1

And I want to know if it's possible to include ensembl gene symbol with UCSC table or with another method

Thank you in advance

Cherif

ChIP-Seq gene • 13k views
ADD COMMENT
1
Entering edit mode

Hi! Do you tried Biomart?

ADD REPLY
0
Entering edit mode

I'm about to try (until now I do not know how proceed :/)

ADD REPLY
5
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thank you for your help!

ADD REPLY
4
Entering edit mode
10.2 years ago

Unfortunately UCSC table browser doesn't have Ensembl gene track for hg38. But they do have it for hg19 and the below command should work for you.

mysql \
  --user=genome \
  -N \
  --host=genome-mysql.cse.ucsc.edu \
  -A \
  -D hg19 \
  -e "select ensGene.name, name2, chrom, strand, txStart, txEnd, value from ensGene, ensemblToGeneName where ensGene.name = ensemblToGeneName.name" > \
  output.txt

You can try the same command with hg38 but you will have to choose other gene models such as refseq or ucsc.

ADD COMMENT
0
Entering edit mode

Thank you very much!!! Yep it works!!

ADD REPLY
4
Entering edit mode
ADD COMMENT
3
Entering edit mode
10.2 years ago
Mitch Bekritsky ★ 1.3k

Here's an easy way to do it from UCSC's table browser:

  1. In the table browser, select Ensembl Genes as your track
  2. Under output format, choose "selected fields from primary and related tables"
  3. Add your output file name, if you want one (otherwise, it will print to the browser)
  4. Click "get output"
  5. On the next page, you will get to choose your fields.
  6. Under linked tables, check ensemblToGeneName, then press "Allow selection from checked tables"
  7. The page will refresh, and you should have a new table called hg19.ensemblToGeneName
  8. Check of name, chrom, strand, txStart, txEnd, and name2 in hg19.ensGene (or any fields you'd like)
  9. In hg19.ensemblToGeneName, check "value", which has the description "alternate gene name"
  10. Press "get output"

If you did it right, you should get a table that looks a bit like this (I took this chunk from chr1:100,000,000-150,000,000):

#hg19.ensGene.name  hg19.ensGene.chrom  hg19.ensGene.strand hg19.ensGene.txStart    hg19.ensGene.txEnd  hg19.ensGene.name2  hg19.ensemblToGeneName.value
ENST00000263174 chr1    +   100111498   100160097   ENSG00000099260 PALMD
ENST00000605497 chr1    +   100111748   100155633   ENSG00000099260 PALMD
ENST00000605613 chr1    +   100133135   100135379   ENSG00000099260 PALMD
ENST00000496843 chr1    +   100148821   100160097   ENSG00000099260 PALMD
ENST00000434734 chr1    +   100163797   100164734   ENSG00000223656 HMGB3P10
ADD COMMENT
0
Entering edit mode

Ashutosh noticed what I missed -- hg38 does not have Ensembl annotations in UCSC yet. Is there a reason you're choosing hg38 and not hg19?

ADD REPLY
0
Entering edit mode

Actually I can use Hg19, but maybe I will need it soon (with Hg38)

ps: Thanks a lot! Yes I get it.

ADD REPLY
0
Entering edit mode

Just one more question please

When select Ensembl Genes as my track , I get "204941" genes and their respective txStart and ..

and when I select ucsc genes as my track , I get just 82961 => about the half.

Is it normal?

ADD REPLY
1
Entering edit mode

Those are two different genome annotations--

The UCSC gene track is described here. It is a set of genes taken from RefSeq, GenBank, CCDS, Rfam, and the tRNA genes track.

I couldn't find a similarly clear description for Ensembl, but this is a good start. It seems they rely on deposited mRNAs and protein sequences in public databases. That might mean that their curation is a bit more relaxed than RefSeq, CCDS, etc.

FWIW, whenever I do annotation, I've generally relied on CCDS and RefSeq.

ADD REPLY
1
Entering edit mode

Silly me! Here is the paper describing the Ensembl annotation pipeline. I only skimmed it, but it is an automated gene pipeline that includes gene predictions, which may explain the higher number of transcripts from the Ensembl table.

ADD REPLY
0
Entering edit mode

I understand ! Thank you Mitch !!

ADD REPLY
0
Entering edit mode

It's my pleasure!

ADD REPLY

Login before adding your answer.

Traffic: 825 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6