First, just a brief mention that this is indeed a topic that has been brought-up in the past, but I would like some more specific feedback.
When comparing RNA-seq to ChIP-seq data, ideally, it would be great to use the same assembly and annotation files for both seqs. For any epigenetics work, I've found the UCSC genome and annotation files to be very valuable (considering the wealth of information you can acquire from tools such as the Table Browser and compatibility to several packages in R that use UCSC genomes). However, one issue that I have with using UCSC for RNA-seq is their GTF. UCSC usually relies more on transcript IDs than gene IDs, and I have noticed that the
gene_id (see below) field from their GTF files usually contains the same transcript id. Thus, summarizing the data by gene-level can be problematic here, if you use functions such as
featureCounts in R, where meta-features will be defined by the information present in the GTF.
gene_id "ENSMUST00000169927.1"; transcript_id "ENSMUST00000169927.1"
(the above is also present in other tracks such as RefSeq)
This is a non-issue if the Ensembl assembly and annotation file is used for RNA-seq. If I were looking at RNA-seq and ChIP-seq separately, I would have no issues choosing Ensembl for RNA-seq and UCSC for ChIP-seq/any epigenetics-seq. However, if one were to compare the results from RNA-seq to ChIP-seq, what is the best approach to resolve the conflicts among the databases. I find that these comparisons are not described in as much detail as they should in manuscripts ("we aligned our data to the rn6 genome....(but from where etc...)"), and I am wondering if people are actually choosing two different databases and simply looking at the gene names as the basis for the comparisons.
Any advice would help greatly!!