Question

Inconsistency on ensemble transcript id between ucsc and gencode

0

Entering edit mode

5.8 years ago

sckinta ▴ 730

I want to classify the peaks/regions I identified from my sequencing experiment to certain annotated regions (intron, 3UTR, CDS, exon and intergenic region) in genome build hg19. So i downloaded some annotation files from both ucsc and gencode.

Using ucsc table browser, I selected genome build GRCh37/hg19 (assembly) -- GENCODE genes v19 (track) --- Basic wgEncodeGencodeBasicV19 (table), and exported bed file for intron plus 0, 3UTR, CDS and 5UTR. The example of bed file would be like

chr1    67208778    67210057    ENST00000237247.6_utr3_26_0_chr1_67208779_f 0   +
chr1    67208778    67210768    ENST00000371039.1_utr3_21_0_chr1_67208779_f 0   +
chr1    67208778    67208882    ENST00000371035.3_utr3_21_0_chr1_67208779_f 0   +

However, when I compared those transcript id to GENCODE V19 GTF (evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)) directly download from gencode.org (https://www.gencodegenes.org/releases/reference_releases.html), I found ~3000 transcript id from UCSC cannot be found in GENCODE GTF.

Here are the code I used to find those difference.

sed "/#/d" gencode.v19.annotation.gtf | awk 'BEGIN{FS="\t"; OFS="\t"}{if ($3=="transcript") print $0}' | cut -f 9 | awk 'BEGIN{FS="; "; OFS="\t"} {sub(/transcript_id /,"",$2); print $2}' | sed "s/\"//g" | sort -u > gencode_transcript_id.tsv

comm -13 gencode_transcript_id.tsv <(cut -f 4 genecode_v19_3UTR | cut -d "_" -f 1 | sort -u) | wc -l 
# 3164

The non-found examples are like

ENST00000211372.5
ENST00000211377.3
ENST00000211402.6
ENST00000229725.4
ENST00000230221.4
ENST00000230236.4
ENST00000241802.5
ENST00000259726.6
ENST00000259875.7
ENST00000259891.7

According to the ucsc table schema description, they should use the same ensemble version (v74)

Does anyone experience with that? Where are those transcript id come from?

genome assembly • 1.5k views

ADD COMMENT • link 5.8 years ago by sckinta ▴ 730

0

Entering edit mode

Do not waste your time on comparing databases. Either use one or the other source of data. I say that because things like this can take quiet some time, and in the end, you'll have to use one of them anyways. I personally use GENCODE, because the files are well-formatted and parsable (which, carefully speaking, does not always holds true for this refseq stuff).

ADD REPLY • link 5.8 years ago by ATpoint 82k