I am a bit confuse with the the relationship/difference between knownGene and wgEncodeGencodeCompV39 on UCSC Table Browser. Anyone know the precise difference between them?
They both can be downloaded from the goldenPath page.
knownGene: The schema is here, which is NOT match the file (knownGene.txt.gz) I downloaded. According to the page, this is completely the GENCODE V39.
wgEncodeGencodeCompV39: The schema is here, which match the file (wgEncodeGencodeCompV39.txt.gz ) I downloaded. This one also claim it's from ALL GENCODE V39 track.
According to UCSC F&Q page. ALL GENCODE track have actually less third-party information then GENCODE, which is true according to my comparison: knownGene table have 21902 more transcripts than wgEncodeGencodeCompV39 table. For the rest commonly-exist transcripts (by name column in both table), the two tables are 100% identical.
Also, from the goldenPath page, I found knownGene updated two days after GENCODE V39 series annotation released. Thus I suspect knownGene is just a label of "latest" version of GENCODE (modified actually), it will be updated everytime GENCODE release new version. Thus, using knownGene annotation in analysis will cause reproducibility problem after months right?
Finally, I though one of them must be origin from raw GENCODE gtf here. So I downloaded all the human gtf files on this GENCODE page, than compared transcripts ID. I found neither knownGene nor wgEncodeGencodeCompV39 100% match any one of these gtf files...which mean to me, it looks like both knownGene table and wgEncodeGencodeCompV39 table are driven from GENCODE, but some modifications are made.
So I have below questions:
Which one I should use for general genome annotation? knownGene or wgEncodeGencodeCompV39? To avoid reproducibility problem, I want to use wgEncodeGencodeCompV39, but it looks like many other packages (like TxDb.Hsapiens.UCSC.hg38.knownGene) use knownGene. I never googled out anyone use wgEncodeGencodeCompVXX.
Where are those " third-party information" 21902 extra transcripts in knownGene come from? when compared with wgEncodeGencodeCompV39.
Can I found a matched schema for knownGene table I downloaded?
While UCSC support staff stops by here periodically you should email this in for prompt help to genome at soe.ucsc.edu
Thanks for the idea, I will try email them directly for detail difference. ^_^
But for questions like "Which one I should use for general genome annotation? knownGene or wgEncodeGencodeCompV39?", communities' idea can be very help here.
Annotation from GENCODE is official release. Sites like UCSC, NCBI and Ensembl add their own annotations. Ensembl's are probably going to be the closest to GENCODE. Not sure what the intended use case is but you can't go wrong using GENCODE/Ensembl annotations.