I want to classify the peaks/regions I identified from my sequencing experiment to certain annotated regions (intron, 3UTR, CDS, exon and intergenic region) in genome build hg19. So i downloaded some annotation files from both ucsc and gencode.
Using ucsc table browser, I selected genome build GRCh37/hg19 (assembly) -- GENCODE genes v19 (track) --- Basic wgEncodeGencodeBasicV19 (table), and exported bed file for intron plus 0, 3UTR, CDS and 5UTR. The example of bed file would be like
chr1 67208778 67210057 ENST00000237247.6_utr3_26_0_chr1_67208779_f 0 +
chr1 67208778 67210768 ENST00000371039.1_utr3_21_0_chr1_67208779_f 0 +
chr1 67208778 67208882 ENST00000371035.3_utr3_21_0_chr1_67208779_f 0 +
However, when I compared those transcript id to GENCODE V19 GTF (evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)) directly download from gencode.org (https://www.gencodegenes.org/releases/reference_releases.html), I found ~3000 transcript id from UCSC cannot be found in GENCODE GTF.
Here are the code I used to find those difference.
sed "/#/d" gencode.v19.annotation.gtf | awk 'BEGIN{FS="\t"; OFS="\t"}{if ($3=="transcript") print $0}' | cut -f 9 | awk 'BEGIN{FS="; "; OFS="\t"} {sub(/transcript_id /,"",$2); print $2}' | sed "s/\"//g" | sort -u > gencode_transcript_id.tsv
comm -13 gencode_transcript_id.tsv <(cut -f 4 genecode_v19_3UTR | cut -d "_" -f 1 | sort -u) | wc -l
# 3164
The non-found examples are like
ENST00000211372.5
ENST00000211377.3
ENST00000211402.6
ENST00000229725.4
ENST00000230221.4
ENST00000230236.4
ENST00000241802.5
ENST00000259726.6
ENST00000259875.7
ENST00000259891.7
According to the ucsc table schema description, they should use the same ensemble version (v74)
Does anyone experience with that? Where are those transcript id come from?
Do not waste your time on comparing databases. Either use one or the other source of data. I say that because things like this can take quiet some time, and in the end, you'll have to use one of them anyways. I personally use GENCODE, because the files are well-formatted and parsable (which, carefully speaking, does not always holds true for this refseq stuff).