Question

RefGene: how to find the starts and ends of genes?

2

Entering edit mode

6.3 years ago

endrebak ▴ 980

I have the following data:

                 Chromosome     Start       End                                       XS                                       XE TranscriptID        GeneID Strand
36945                 chr19  54754594  54756329           b'54754594,54756005,54756286,'           b'54755063,54756202,54756329,'    NR_110738  LOC101928804      -
36948                 chr19  54769421  54771064           b'54769421,54770669,54771021,'           b'54769891,54770937,54771064,'    NR_110737  LOC101928804      -
36949                 chr19  54769421  54771064           b'54769421,54770740,54771021,'           b'54769891,54770937,54771064,'    NR_110738  LOC101928804      -
36951                 chr19  54785868  54835292  b'54785868,54816103,54834899,54835251,'  b'54785899,54816541,54835167,54835292,'    NR_110737  LOC101928804      -
36952                 chr19  54785868  54835292  b'54785868,54816103,54834970,54835251,'  b'54785899,54816541,54835167,54835292,'    NR_110738  LOC101928804      -

Here you see transcripts from the same gene. Start is txStart and End is txEnd. For this gene (LOC101928804), can I say that the length of the gene is from 54754594 to 54835292 (from the start of first transcript to the end of last transcript). Or is this an oversimplification in some way?

refgene ucsc • 5.0k views

ADD COMMENT • link updated 6.3 years ago by Luis Nassar ▴ 670 • written 6.3 years ago by endrebak ▴ 980

1

Entering edit mode

I'll not do that. If all your transcripts get an exon skipping in 3' or 5' you'll miss some information. What you can do is use BiomaRt with your GeneID to get the gene position from Ensembl annotation

ADD REPLY • link 6.3 years ago by Bastien Hervé 6.4k

0

Entering edit mode

Thanks. Do you know if refgene has some gene start and end info? But perhaps that is a question for another thread.

ADD REPLY • link 6.3 years ago by endrebak ▴ 980

0

Entering edit mode

ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz

grep -m 1 LOC101928804 GRCh38_latest_genomic.gff
#NC_000019.10   BestRefSeq  gene    54769422    54771064    .   -   .   ID=gene-LOC101928804;Dbxref=GeneID:101928804;Name=LOC101928804;description=uncharacterized LOC101928804;gbkey=Gene;gene=LOC101928804;gene_biotype=lncRNA

Gene name information is under gene=

ADD REPLY • link 6.3 years ago by Bastien Hervé 6.4k

0

Entering edit mode

If this is the exon skipping you speak of I do not see why it should matter: https://en.wikipedia.org/wiki/Exon_skipping

UCSC refgene probably does not include such mistakes, no?

ADD REPLY • link 6.3 years ago by endrebak ▴ 980

0

Entering edit mode

How did you get this from RefGene ?

UCSC refgene probably does not include such mistakes

These are not mistakes, just different types of transcript

ADD REPLY • link 6.3 years ago by Bastien Hervé 6.4k

score 4 · Accepted Answer · 2019-04-03

4

Entering edit mode

6.3 years ago

Luis Nassar ▴ 670

In most all cases what you are describing should work to find the start/end of the gene. The way we build the refGene track is to BLAT all the NCBI sequences ourselves, though we also offer a track with all the NCBI provided alignmend: ncbiRefSeq. All of the NR_* accession numbers are experimentally validated sequences. the txStart is equal to the first position of exonStarts and txEnd equals the final exonEnds.

You can query these from our public MySQL server:

$mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ee "select * from refGene where name2 like 'LOC101928804' and chrom not like '%alt'" hg38

*************************** 1. row ***************************
         bin: 1002
        name: NR_110738
       chrom: chr19
      strand: -
     txStart: 54754594
       txEnd: 54756329
    cdsStart: 54756329
      cdsEnd: 54756329
   exonCount: 3
  exonStarts: 54754594,54756005,54756286,
    exonEnds: 54755063,54756202,54756329,
       score: 0
       name2: LOC101928804
cdsStartStat: unk
  cdsEndStat: unk
  exonFrames: -1,-1,-1,

And also pull out just the txStart and txEnd which may save you some time:

$mysql -h genome-mysql.soe.ucsc.edu -ugenome -e "select name,txStart,txEnd from refGene where name2 like 'LOC101928804' and chrom not like '%alt'" hg38

+-----------+----------+----------+
| name      | txStart  | txEnd    |
+-----------+----------+----------+
| NR_110738 | 54754594 | 54756329 |
| NR_110737 | 54769421 | 54771064 |
| NR_110738 | 54769421 | 54771064 |
| NR_110737 | 54785868 | 54835292 |
| NR_110738 | 54785868 | 54835292 |
+-----------+----------+----------+

You can also pull out the selected info via point and click using http://genome.ucsc.edu/cgi-bin/hgTables.

Lou UCSC GB

ADD COMMENT • link 6.3 years ago by Luis Nassar ▴ 670

0

Entering edit mode

Wonderful. Is ncbiRefSeq also a database somewhere?

ADD REPLY • link 6.3 years ago by endrebak ▴ 980

0

Entering edit mode

You can find all all the hg38 annotations including the ncbiRefSeq in our downloads directory:

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

You can also do a MySQL query same as above, but instead of querying the refGene table, you can query ncbiRefSeq:

$mysql -h genome-mysql.soe.ucsc.edu -ugenome -e "select name,txStart,txEnd from ncbiRefSeq where name2 like 'LOC101928804' and chrom not like '%alt'" hg38

+-------------+----------+----------+
| name        | txStart  | txEnd    |
+-------------+----------+----------+
| NR_110737.1 | 54769421 | 54771064 |
| NR_110738.1 | 54769421 | 54771064 |
+-------------+----------+----------+

You can look at the track description page for all the different NCBI tracks and a description of what they contain: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite

The only track we 'generate' ourselves is the UCSC RefSeq (the table is named RefGene) which is described as follows:

UCSC RefSeq – annotations generated from UCSC's realignment of RNAs with NM and NR accessions to the human genome. This track was previously known as the "RefSeq Genes" track.

It is worth mentioning, however, that almost all the alignments match the same location NCBI reports. Nearly 100%. The few differences are due to us using BLAT with certain parameters, while they BLAST their sequences (I believe).

ADD REPLY • link 6.3 years ago by Luis Nassar ▴ 670

0

Entering edit mode

Dear Nassar,

I happen to notice that in some cases, the cdsStart equals to the cdsEnd..... Why is that happening?

ADD REPLY • link 5.2 years ago by u3005579 • 0