I have matched RNA-seq and ATAC-seq data from two cell subsets and I am trying to compare the expressed genes from the RNA-seq data with the accessible peaks from the ATAC-seq data. However, I have found that the methods which I used to annotate the expressed genes and accessible peaks use different databases/approaches, and therefore the annotations in the two datasets show large discrepancies. For example, in some cases, genes and peaks at the same location within the genome are assigned to different overlapping loci.
To assign genes to features, I used featureCounts with a hg38 GTF file downloaded from Ensembl. Therefore, the output is a list of read counts assigned to Ensembl IDs for each sample, which I then mapped to Entrez IDs and HGNC gene symbols using BioMart. In contrast, for peak annotation, I used Homer annotatePeaks.pl (http://homer.ucsd.edu/homer/ngs/annotation.html) to assign peaks to the nearest gene, which uses the RefSeq transcription start sites to assign peaks to the nearest gene.
To try and fix this problem, I ran Homer custom annotation using the same GTF file that I used for my RNA-seq data. However, Homer annotation seems to take the transcript_id from the GTF file instead of the gene_id, which is used for featureCounts. I tried swapping the Ensembl IDs into the transcript_id position of the GTF file, however the Ensembl IDs assigned to the peaks still poorly match my RNA-seq gene annotations.
When I run the Homer annotation, it also says “Features that will be considered: exon”. I am unclear as to why Homer would only use the exon regions from the GTF file for annotation.
Could anyone suggest a way in which I can align the gene annotations of my RNA-seq and ATAC-seq data?
Many thanks for the advice.