GTF file problem
0
0
Entering edit mode
3.7 years ago

I'm performing an RNA-seq analysis. I'm making a count matrix using bam files and a gtf file to make a count matrix.

I wanted to compare my results using a gtf file from the hg19 (ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/) build of the genome and the hg38 (ftp://ftp.ensembl.org/pub/release-84/gtf/homo_sapiens/) build of the genome.

I used GenomicFeatures and GenomicAlignments in R to create a count matrix of raw reads using the following code.

filenames=list.files() bamfiles <- BamFileList(filenames, yieldSize=2000000) se <- summarizeOverlaps(features=ebg, reads=bamfiles, mode="Union", singleEnd=TRUE, ignore.strand=TRUE) counts=assay(se)

My count matrix is completely different depending on which gtf file I use.

When I plot the statistics from DSS or DESeq results in the downstream analysis based on the gene id (i.e. ENSG00000223972), I get no correlation.

Why would I get completely different count matrices and results when I use different gtf files?

Can someone point me to good gtf files for hg19 and hg38 builds of the genome?

gtf RNA-Seq hg19 hg38 • 1.5k views
ADD COMMENT
1
Entering edit mode

No wonder. You can't swap GTF files alone. You will need to realign the data against the other genome if you want to get the right results. Your chromosome names also need to match so make sure you use compatible genome builds and GTF files that go with them. You can find bundled sequence/annotation/indexes at iGenomes site.

Use featureCounts to do your counting.

ADD REPLY
0
Entering edit mode

I think I spotted your problem right here: you are using a GTF file.

ADD REPLY

Login before adding your answer.

Traffic: 1749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6