Question: Inconsistency on ensemble transcript id between ucsc and gencode
0
gravatar for sckinta
11 days ago by
sckinta450
United States
sckinta450 wrote:

I want to classify the peaks/regions I identified from my sequencing experiment to certain annotated regions (intron, 3UTR, CDS, exon and intergenic region) in genome build hg19. So i downloaded some annotation files from both ucsc and gencode.

Using ucsc table browser, I selected genome build GRCh37/hg19 (assembly) -- GENCODE genes v19 (track) --- Basic wgEncodeGencodeBasicV19 (table), and exported bed file for intron plus 0, 3UTR, CDS and 5UTR. The example of bed file would be like

chr1    67208778    67210057    ENST00000237247.6_utr3_26_0_chr1_67208779_f 0   +
chr1    67208778    67210768    ENST00000371039.1_utr3_21_0_chr1_67208779_f 0   +
chr1    67208778    67208882    ENST00000371035.3_utr3_21_0_chr1_67208779_f 0   +

However, when I compared those transcript id to GENCODE V19 GTF (evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)) directly download from gencode.org (https://www.gencodegenes.org/releases/reference_releases.html), I found ~3000 transcript id from UCSC cannot be found in GENCODE GTF.

Here are the code I used to find those difference.

sed "/#/d" gencode.v19.annotation.gtf | awk 'BEGIN{FS="\t"; OFS="\t"}{if ($3=="transcript") print $0}' | cut -f 9 | awk 'BEGIN{FS="; "; OFS="\t"} {sub(/transcript_id /,"",$2); print $2}' | sed "s/\"//g" | sort -u > gencode_transcript_id.tsv

comm -13 gencode_transcript_id.tsv <(cut -f 4 genecode_v19_3UTR | cut -d "_" -f 1 | sort -u) | wc -l 
# 3164

The non-found examples are like

ENST00000211372.5
ENST00000211377.3
ENST00000211402.6
ENST00000229725.4
ENST00000230221.4
ENST00000230236.4
ENST00000241802.5
ENST00000259726.6
ENST00000259875.7
ENST00000259891.7

According to the ucsc table schema description, they should use the same ensemble version (v74)

Does anyone experience with that? Where are those transcript id come from?

assembly genome • 68 views
ADD COMMENTlink modified 11 days ago • written 11 days ago by sckinta450

Do not waste your time on comparing databases. Either use one or the other source of data. I say that because things like this can take quiet some time, and in the end, you'll have to use one of them anyways. I personally use GENCODE, because the files are well-formatted and parsable (which, carefully speaking, does not always holds true for this refseq stuff).

ADD REPLYlink modified 11 days ago • written 11 days ago by ATpoint5.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 943 users visited in the last hour