TSS of protein coding genes
15 months ago
arsala521 ▴ 10

Hi everyone,

I want to have transcription start sites (TSS) of all protein-coding genes in the genome. There is a couple of things I want to ask about.

I found two relevant files for gene coordinates at UCSC browser: refGene.txt.gz (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz), and geneid.txt.gz (http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/geneid.txt.gz). I found them to have same fields. Can someone recommend which should be used?

Also can someone please suggest a way to extract information of only protein-coding genes from these files?

Thanks in advance

TSS protein coding genes
You should get the GTF file from GENCODE. That has information on protein coding transcripts and by extension, genes.

Thank you. It really helped.

6 weeks ago
gperez8 ▴ 10


Geneid Genes (geneid.txt.gz) is an older transcript predictor algorithm that is based on the genome sequence alone and only relevant when you are working on a particular locus where you think that the manually curated gene models (Ensembl and RefSeq) have errors.

UCSC RefSeq (refGene.txt.gz) is NCBI RNA reference sequences aligned against the human genome using the Blast-Like Alignment Tool of the UCSC Genome Browser. The track shows known human protein-coding and non-protein-coding genes.

See our FAQ page for more information: http://genome.ucsc.edu/FAQ/FAQgenes.html#genename

You can use the Table Browser to extract information of start sites (TSS) protein-coding genes. For example, to query the UCSC RefSeq (refGene) on hg38, navigate to the Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) and make the following selections:

  1. Under Select dataset:

    clade: Mammal

    genome: Human

    assembly: Dec. 2013 (GRCh38/hg38)

    group: Genes and Gene Predictions

    track: NCBI RefSeq

    table: UCSC RefSeq (refGene)

  2. Set the region: to “genome”

  3. Click create next to “filter:”

  4. On the “Filter on Fields from hg38.refGene” page, insert “cdsStart” next to cdsEnd is, change ignored to “!=” then click submit

  5. Set the output format to “Selected fields from primary and related tables”. This will allow you to select fields of interest. Click get output

  6. On the following page, scroll down to the Linked Tables section and select "hgFixed refLink" then click allow selection from checked tables

  7. You can then select the following fields:

    name Name of gene

    chrom Reference sequence chromosome or scaffold

    strand + or - for strand

    txStart Transcription start position

    protAcc protein accession

  8. Click get output

This should display all the genes with their transcription start sites and protein accession numbers.

If you have any follow up questions, our public help desk can always be reached at genome@soe.ucsc.edu. You may also send questions to genome-www@soe.ucsc.edu if they contain sensitive data. For any Genome Browser questions on Biostars, the UCSC tag is the best way to ensure visibility by the team.


