Question: Aligning RNA-seq and ATAC-seq gene annotations
0
gravatar for Lucy
7 months ago by
Lucy0
Lucy0 wrote:

Hi,

I have matched RNA-seq and ATAC-seq data from two cell subsets and I am trying to compare the expressed genes from the RNA-seq data with the accessible peaks from the ATAC-seq data. However, I have found that the methods which I used to annotate the expressed genes and accessible peaks use different databases/approaches, and therefore the annotations in the two datasets show large discrepancies. For example, in some cases, genes and peaks at the same location within the genome are assigned to different overlapping loci.

To assign genes to features, I used featureCounts with a hg38 GTF file downloaded from Ensembl. Therefore, the output is a list of read counts assigned to Ensembl IDs for each sample, which I then mapped to Entrez IDs and HGNC gene symbols using BioMart. In contrast, for peak annotation, I used Homer annotatePeaks.pl (http://homer.ucsd.edu/homer/ngs/annotation.html) to assign peaks to the nearest gene, which uses the RefSeq transcription start sites to assign peaks to the nearest gene.

To try and fix this problem, I ran Homer custom annotation using the same GTF file that I used for my RNA-seq data. However, Homer annotation seems to take the transcript_id from the GTF file instead of the gene_id, which is used for featureCounts. I tried swapping the Ensembl IDs into the transcript_id position of the GTF file, however the Ensembl IDs assigned to the peaks still poorly match my RNA-seq gene annotations.

When I run the Homer annotation, it also says “Features that will be considered: exon”. I am unclear as to why Homer would only use the exon regions from the GTF file for annotation.

Could anyone suggest a way in which I can align the gene annotations of my RNA-seq and ATAC-seq data?

Many thanks for the advice.

Best wishes,

Lucy

ADD COMMENTlink modified 7 months ago • written 7 months ago by Lucy0

Can you clarify what you mean by "peaks still match poorly?".

Do you mean that many of the peaks get assigned IDs that arn't counted in your RNA-seq dataset? Or that there are many genes in the RNA-seq dataset that have no peaks associated with them? Or that things that change in the RNASeq do not have peaks close to them?

ADD REPLYlink written 7 months ago by i.sudbery5.2k

Only 26465/66908 peaks have Ensembl IDs that match those in the RNA-seq data. However, when I used the default Homer annotation, 40709/66908 peaks had Ensembl IDs that match those in the RNA-seq file. There are still problems with the default annotation though. For example in some cases, genes within a particular locus will be assigned to two nearby Ensembl IDs, whereas in Homer, peaks at both of these genes would be assigned to the same Ensembl ID.

ADD REPLYlink written 7 months ago by Lucy0

If you are using the same GTF file, then all ENSEMBL IDs should have an entry in the RNA-seq data. The number of counts may be zero, but their should be an entry in the table. Can you provide an example of an ID that is present in the homer output, but not in the featureCounts output?

ADD REPLYlink written 7 months ago by i.sudbery5.2k

Its a bit hard to follow but can you let me know your goal with RNA and ATAC ? and the genome build you are interested in ?

ADD REPLYlink written 7 months ago by geek_y9.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 725 users visited in the last hour