I have the following data:
                 Chromosome     Start       End                                       XS                                       XE TranscriptID        GeneID Strand
36945                 chr19  54754594  54756329           b'54754594,54756005,54756286,'           b'54755063,54756202,54756329,'    NR_110738  LOC101928804      -
36948                 chr19  54769421  54771064           b'54769421,54770669,54771021,'           b'54769891,54770937,54771064,'    NR_110737  LOC101928804      -
36949                 chr19  54769421  54771064           b'54769421,54770740,54771021,'           b'54769891,54770937,54771064,'    NR_110738  LOC101928804      -
36951                 chr19  54785868  54835292  b'54785868,54816103,54834899,54835251,'  b'54785899,54816541,54835167,54835292,'    NR_110737  LOC101928804      -
36952                 chr19  54785868  54835292  b'54785868,54816103,54834970,54835251,'  b'54785899,54816541,54835167,54835292,'    NR_110738  LOC101928804      -
Here you see transcripts from the same gene. Start is txStart and End is txEnd. For this gene (LOC101928804), can I say that the length of the gene is from 54754594 to 54835292 (from the start of first transcript to the end of last transcript). Or is this an oversimplification in some way?
I'll not do that. If all your transcripts get an exon skipping in 3' or 5' you'll miss some information. What you can do is use BiomaRt with your
GeneIDto get the gene position from Ensembl annotationThanks. Do you know if refgene has some gene start and end info? But perhaps that is a question for another thread.
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
Gene name information is under
gene=If this is the exon skipping you speak of I do not see why it should matter: https://en.wikipedia.org/wiki/Exon_skipping
UCSC refgene probably does not include such mistakes, no?
How did you get this from RefGene ?
These are not mistakes, just different types of transcript