I'm performing an RNA-seq analysis. I'm making a count matrix using bam files and a gtf file to make a count matrix.
I wanted to compare my results using a gtf file from the hg19 (ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/) build of the genome and the hg38 (ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/) build of the genome.
I used GenomicFeatures and GenomicAlignments in R to create a count matrix of raw reads using the following code.
filenames=list.files() bamfiles <- BamFileList(filenames, yieldSize=2000000) se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=TRUE, ignore.strand=TRUE) counts=assay(se)
My count matrix is completely different depending on which gtf file I use.
When I plot the statistics from DSS or DESeq results in the downstream analysis based on the gene id (i.e. ENSG00000223972), I get no correlation.
Why would I get completely different count matrices and results when I use different gtf files?
Can someone point me to good gtf files for hg19 and hg38 builds of the genome?