Question: RefGene: how to find the starts and ends of genes?
1
gravatar for endrebak
12 months ago by
endrebak810
github.com/endrebak
endrebak810 wrote:

I have the following data:

                 Chromosome     Start       End                                       XS                                       XE TranscriptID        GeneID Strand
36945                 chr19  54754594  54756329           b'54754594,54756005,54756286,'           b'54755063,54756202,54756329,'    NR_110738  LOC101928804      -
36948                 chr19  54769421  54771064           b'54769421,54770669,54771021,'           b'54769891,54770937,54771064,'    NR_110737  LOC101928804      -
36949                 chr19  54769421  54771064           b'54769421,54770740,54771021,'           b'54769891,54770937,54771064,'    NR_110738  LOC101928804      -
36951                 chr19  54785868  54835292  b'54785868,54816103,54834899,54835251,'  b'54785899,54816541,54835167,54835292,'    NR_110737  LOC101928804      -
36952                 chr19  54785868  54835292  b'54785868,54816103,54834970,54835251,'  b'54785899,54816541,54835167,54835292,'    NR_110738  LOC101928804      -

Here you see transcripts from the same gene. Start is txStart and End is txEnd. For this gene (LOC101928804), can I say that the length of the gene is from 54754594 to 54835292 (from the start of first transcript to the end of last transcript). Or is this an oversimplification in some way?

ucsc refgene • 521 views
ADD COMMENTlink modified 12 months ago by Luis Nassar360 • written 12 months ago by endrebak810
1

I'll not do that. If all your transcripts get an exon skipping in 3' or 5' you'll miss some information. What you can do is use BiomaRt with your GeneID to get the gene position from Ensembl annotation

ADD REPLYlink modified 12 months ago • written 12 months ago by Bastien Hervé4.5k

Thanks. Do you know if refgene has some gene start and end info? But perhaps that is a question for another thread.

ADD REPLYlink written 12 months ago by endrebak810

ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz

grep -m 1 LOC101928804 GRCh38_latest_genomic.gff
#NC_000019.10   BestRefSeq  gene    54769422    54771064    .   -   .   ID=gene-LOC101928804;Dbxref=GeneID:101928804;Name=LOC101928804;description=uncharacterized LOC101928804;gbkey=Gene;gene=LOC101928804;gene_biotype=lncRNA

Gene name information is under gene=

ADD REPLYlink written 12 months ago by Bastien Hervé4.5k

If this is the exon skipping you speak of I do not see why it should matter: https://en.wikipedia.org/wiki/Exon_skipping

UCSC refgene probably does not include such mistakes, no?

ADD REPLYlink written 12 months ago by endrebak810

How did you get this from RefGene ?

UCSC refgene probably does not include such mistakes

These are not mistakes, just different types of transcript

ADD REPLYlink written 12 months ago by Bastien Hervé4.5k
3
gravatar for Luis Nassar
12 months ago by
Luis Nassar360
UCSC Genome Browser
Luis Nassar360 wrote:

In most all cases what you are describing should work to find the start/end of the gene. The way we build the refGene track is to BLAT all the NCBI sequences ourselves, though we also offer a track with all the NCBI provided alignmend: ncbiRefSeq. All of the NR_* accession numbers are experimentally validated sequences. the txStart is equal to the first position of exonStarts and txEnd equals the final exonEnds.

You can query these from our public MySQL server:

$mysql -h genome-mysql.soe.ucsc.edu -ugenome -Ee "select * from refGene where name2 like 'LOC101928804' and chrom not like '%alt'" hg38

*************************** 1. row ***************************
         bin: 1002
        name: NR_110738
       chrom: chr19
      strand: -
     txStart: 54754594
       txEnd: 54756329
    cdsStart: 54756329
      cdsEnd: 54756329
   exonCount: 3
  exonStarts: 54754594,54756005,54756286,
    exonEnds: 54755063,54756202,54756329,
       score: 0
       name2: LOC101928804
cdsStartStat: unk
  cdsEndStat: unk
  exonFrames: -1,-1,-1,

And also pull out just the txStart and txEnd which may save you some time:

$mysql -h genome-mysql.soe.ucsc.edu -ugenome -e "select name,txStart,txEnd from refGene where name2 like 'LOC101928804' and chrom not like '%alt'" hg38

+-----------+----------+----------+
| name      | txStart  | txEnd    |
+-----------+----------+----------+
| NR_110738 | 54754594 | 54756329 |
| NR_110737 | 54769421 | 54771064 |
| NR_110738 | 54769421 | 54771064 |
| NR_110737 | 54785868 | 54835292 |
| NR_110738 | 54785868 | 54835292 |
+-----------+----------+----------+

You can also pull out the selected info via point and click using http://genome.ucsc.edu/cgi-bin/hgTables.

Lou UCSC GB

ADD COMMENTlink written 12 months ago by Luis Nassar360

Wonderful. Is ncbiRefSeq also a database somewhere?

ADD REPLYlink written 12 months ago by endrebak810

You can find all all the hg38 annotations including the ncbiRefSeq in our downloads directory:

http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/

You can also do a MySQL query same as above, but instead of querying the refGene table, you can query ncbiRefSeq:

$mysql -h genome-mysql.soe.ucsc.edu -ugenome -e "select name,txStart,txEnd from ncbiRefSeq where name2 like 'LOC101928804' and chrom not like '%alt'" hg38

+-------------+----------+----------+
| name        | txStart  | txEnd    |
+-------------+----------+----------+
| NR_110737.1 | 54769421 | 54771064 |
| NR_110738.1 | 54769421 | 54771064 |
+-------------+----------+----------+

You can look at the track description page for all the different NCBI tracks and a description of what they contain: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=refSeqComposite

The only track we 'generate' ourselves is the UCSC RefSeq (the table is named RefGene) which is described as follows:

UCSC RefSeq – annotations generated from UCSC's realignment of RNAs with NM and NR accessions to the human genome. This track was previously known as the "RefSeq Genes" track.

It is worth mentioning, however, that almost all the alignments match the same location NCBI reports. Nearly 100%. The few differences are due to us using BLAT with certain parameters, while they BLAST their sequences (I believe).

ADD REPLYlink modified 12 months ago • written 12 months ago by Luis Nassar360
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1182 users visited in the last hour