Assembly database choice for RNA-seq and ChIP-seq
2
0
Entering edit mode
5.2 years ago
lshepard ▴ 470

Hello,

First, just a brief mention that this is indeed a topic that has been brought-up in the past, but I would like some more specific feedback.

When comparing RNA-seq to ChIP-seq data, ideally, it would be great to use the same assembly and annotation files for both seqs. For any epigenetics work, I've found the UCSC genome and annotation files to be very valuable (considering the wealth of information you can acquire from tools such as the Table Browser and compatibility to several packages in R that use UCSC genomes). However, one issue that I have with using UCSC for RNA-seq is their GTF. UCSC usually relies more on transcript IDs than gene IDs, and I have noticed that the gene_id (see below) field from their GTF files usually contains the same transcript id. Thus, summarizing the data by gene-level can be problematic here, if you use functions such as featureCounts in R, where meta-features will be defined by the information present in the GTF.

gene_id "ENSMUST00000169927.1"; transcript_id "ENSMUST00000169927.1"

(the above is also present in other tracks such as RefSeq)

This is a non-issue if the Ensembl assembly and annotation file is used for RNA-seq. If I were looking at RNA-seq and ChIP-seq separately, I would have no issues choosing Ensembl for RNA-seq and UCSC for ChIP-seq/any epigenetics-seq. However, if one were to compare the results from RNA-seq to ChIP-seq, what is the best approach to resolve the conflicts among the databases. I find that these comparisons are not described in as much detail as they should in manuscripts ("we aligned our data to the rn6 genome....(but from where etc...)"), and I am wondering if people are actually choosing two different databases and simply looking at the gene names as the basis for the comparisons.

Any advice would help greatly!!

RNA-Seq ChIP-Seq • 1.5k views
ADD COMMENT
0
Entering edit mode
5.2 years ago

Presumably initial ChIP-Seq analysis is independent of gene annotations. It only depends on genome version. At the end, you will have certain peaks.

For RNA-Seq, you could chose UCSC or Ensemble gene annotations. The example you are showing is a bit strange or an exception, but usually that's not the case. You can even consider more comprehensive annotations from Gencode.

To integrate ChIP-Seq an RNA-Seq, the genome version used should be the same, and if you are doing any promoters based analysis on ChIP-Seq data, get the TSS information from the same GTF file you used for RNA-Seq analysis. If you used GENCODE, use GENCODE TSS files.

ADD COMMENT
0
Entering edit mode

To clarify though, while indeed the version should be the same, both Ensembl (I focus on Ensembl, because Gencode is only for humans and mice) and UCSC have their own genome files. How much discrepancy is there between these genomes (they would still be the same version such as mm10 or rn6)? I am aware of the chr. name differences, and potential differences in the extra chromosomes at the end of FASTA files, but if everything else is the same then I don't see the harm in using the genome from their respective database that I want to annotate.

I ask because the GTF from UCSC still doesn't make sense to me to be used for RNAseq (in terms of having the IDs repeated). So if alignment were to be done with Ensembl for RNA and UCSC for Chipseq annotated with Ensembl track at UCSC, where promoter, intron and exon coordinates can be easily extracted, would this be an issue? Thanks for the clarification.

ADD REPLY
0
Entering edit mode

To complement my comment, I wanted to point out that indeed I have seen previous BioStar post discussing this such as this one

But in the end, there wasn't a conclusion about dealing with the transcript variants for RNA-seq gene-level analysis, which is relevant to my original question, if I were to want to use UCSC for both analysis.

ADD REPLY
0
Entering edit mode
5.2 years ago
lshepard ▴ 470

Posting this solution to my question in case it helps anyone else since the current output of GTF from UCSC is a bit confusing (I am not sure why they still keep the GTF output as an option, given the work around below, but oh well):

This is the solution I found to ensure that UCSC GTF contains distinct gene_id vs transcript_id fields which would lead to the correct gene-level summarization for RNA-seq (thus, using the same resources for both RNA/ChIP or any epigenetics data). Use the genePredToGtf tool described here. And of course, if you are using UCSC, you can still choose the Ensembl/Gencode tracks if desired.

ADD COMMENT
0
Entering edit mode

You could always download the mappings of ENSEMBL transcript IDs and gene IDs (e.g. via biomaRt) and set the record straight within R even if you started off with the UCSC "GTF" file.

ADD REPLY
1
Entering edit mode

Yes, that thought came to my mind, but they also mention other concerns about the GTF from Table Browser in the link above, so I think the option above may be a quicker way to resolve these other concerns:

  • The Table Browser adds start and stop codon annotations whether or not the transcript alignment includes proper start and stop codons.
  • Tables not in genePred format (e.g., knownCanonical) will produce unexpected GTF output, in addition to the other "known-limitations for Table Browser GTF output" listed here.
  • Issue with stop codons in GTF output from Table Browser
ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6