I have UCSC data that was downloaded for hg19 about 6 months ago, and now I need GTF files for the same data to use with Tophat. Should I be translating the data dump files I have for knownGenes and refGenes to GTF, or is it safe to redownload as GTF and assume the files are based on the same version of annotation.
I tried making my files into GTF, but didn't have success....I have UCSC downloaded annotations that look like this:
> ... 585 NR_026818 chr1 - 34610 36081 36081 > 36081 3 34610,35276,35720, 35174,35481,36081, 0 > FAM138A unk unk -1,-1,-1, ...
I tried using awk just to make a file that looks like this (for example)...
> 1 hg19_refGene exon 11874 12227 . + 0 > gene_id "uc001aaa.3"; 1 hg19_refGene exon 12613 12721 > . + 1 gene_id "uc001aaa.3"; 1 hg19_refGene > exon 13221 14409 . + 2 gene_id "uc001aaa.3"; > 1 hg19_refGene exon 11874 12227 . + 0 > gene_id "uc010nxq.1"; 1 hg19_refGene exon 12595 12721 > . + 1 gene_id "uc010nxq.1";
but Tophat says that " Warning: TopHat did not find any junctions in GTF file" so obviously I am not meeting whatever the requirements are for tophat to use these annotations correctly. However, I don't know what I am missing.
Alternatively, if you can tell me if the files for RefSeq and knownGenes are not updated once they are released for a genome build I could go back and re-download the files I have in the format I need. Until I know that however, I am wary of using whatever files are available now since I'm not confident that the data has not changed.