Question: Assembly database choice for RNA-seq and ChIP-seq
gravatar for lshepard
12 months ago by
United States
lshepard390 wrote:


First, just a brief mention that this is indeed a topic that has been brought-up in the past, but I would like some more specific feedback.

When comparing RNA-seq to ChIP-seq data, ideally, it would be great to use the same assembly and annotation files for both seqs. For any epigenetics work, I've found the UCSC genome and annotation files to be very valuable (considering the wealth of information you can acquire from tools such as the Table Browser and compatibility to several packages in R that use UCSC genomes). However, one issue that I have with using UCSC for RNA-seq is their GTF. UCSC usually relies more on transcript IDs than gene IDs, and I have noticed that the gene_id (see below) field from their GTF files usually contains the same transcript id. Thus, summarizing the data by gene-level can be problematic here, if you use functions such as featureCounts in R, where meta-features will be defined by the information present in the GTF.

gene_id "ENSMUST00000169927.1"; transcript_id "ENSMUST00000169927.1"

(the above is also present in other tracks such as RefSeq)

This is a non-issue if the Ensembl assembly and annotation file is used for RNA-seq. If I were looking at RNA-seq and ChIP-seq separately, I would have no issues choosing Ensembl for RNA-seq and UCSC for ChIP-seq/any epigenetics-seq. However, if one were to compare the results from RNA-seq to ChIP-seq, what is the best approach to resolve the conflicts among the databases. I find that these comparisons are not described in as much detail as they should in manuscripts ("we aligned our data to the rn6 genome....(but from where etc...)"), and I am wondering if people are actually choosing two different databases and simply looking at the gene names as the basis for the comparisons.

Any advice would help greatly!!

rna-seq chip-seq • 448 views
ADD COMMENTlink modified 11 months ago • written 12 months ago by lshepard390
gravatar for geek_y
12 months ago by
geek_y10k wrote:

Presumably initial ChIP-Seq analysis is independent of gene annotations. It only depends on genome version. At the end, you will have certain peaks.

For RNA-Seq, you could chose UCSC or Ensemble gene annotations. The example you are showing is a bit strange or an exception, but usually that's not the case. You can even consider more comprehensive annotations from Gencode.

To integrate ChIP-Seq an RNA-Seq, the genome version used should be the same, and if you are doing any promoters based analysis on ChIP-Seq data, get the TSS information from the same GTF file you used for RNA-Seq analysis. If you used GENCODE, use GENCODE TSS files.

ADD COMMENTlink written 12 months ago by geek_y10k

To clarify though, while indeed the version should be the same, both Ensembl (I focus on Ensembl, because Gencode is only for humans and mice) and UCSC have their own genome files. How much discrepancy is there between these genomes (they would still be the same version such as mm10 or rn6)? I am aware of the chr. name differences, and potential differences in the extra chromosomes at the end of FASTA files, but if everything else is the same then I don't see the harm in using the genome from their respective database that I want to annotate.

I ask because the GTF from UCSC still doesn't make sense to me to be used for RNAseq (in terms of having the IDs repeated). So if alignment were to be done with Ensembl for RNA and UCSC for Chipseq annotated with Ensembl track at UCSC, where promoter, intron and exon coordinates can be easily extracted, would this be an issue? Thanks for the clarification.

ADD REPLYlink modified 12 months ago • written 12 months ago by lshepard390

To complement my comment, I wanted to point out that indeed I have seen previous BioStar post discussing this such as this one

But in the end, there wasn't a conclusion about dealing with the transcript variants for RNA-seq gene-level analysis, which is relevant to my original question, if I were to want to use UCSC for both analysis.

ADD REPLYlink modified 12 months ago • written 12 months ago by lshepard390
gravatar for lshepard
11 months ago by
United States
lshepard390 wrote:

Posting this solution to my question in case it helps anyone else since the current output of GTF from UCSC is a bit confusing (I am not sure why they still keep the GTF output as an option, given the work around below, but oh well):

This is the solution I found to ensure that UCSC GTF contains distinct gene_id vs transcript_id fields which would lead to the correct gene-level summarization for RNA-seq (thus, using the same resources for both RNA/ChIP or any epigenetics data). Use the genePredToGtf tool described here. And of course, if you are using UCSC, you can still choose the Ensembl/Gencode tracks if desired.

ADD COMMENTlink modified 11 months ago • written 11 months ago by lshepard390

You could always download the mappings of ENSEMBL transcript IDs and gene IDs (e.g. via biomaRt) and set the record straight within R even if you started off with the UCSC "GTF" file.

ADD REPLYlink written 11 months ago by Friederike5.2k

Yes, that thought came to my mind, but they also mention other concerns about the GTF from Table Browser in the link above, so I think the option above may be a quicker way to resolve these other concerns:

  • The Table Browser adds start and stop codon annotations whether or not the transcript alignment includes proper start and stop codons.
  • Tables not in genePred format (e.g., knownCanonical) will produce unexpected GTF output, in addition to the other "known-limitations for Table Browser GTF output" listed here.
  • Issue with stop codons in GTF output from Table Browser
ADD REPLYlink written 11 months ago by lshepard390
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 707 users visited in the last hour