Difference between GTF file with CHR and without CHR. ENSEMBL
2
7
Entering edit mode
4.5 years ago
rf ▴ 60

I am very new to RNA-Seq. I am trying to align my samples with STAR. I am generating the genome index myself. Because I was hoping to add the spike-in sequence to the GTF and FASTA files.

I am downloading the GTF file from here: ftp://ftp.ensembl.org/pub/release-86/gtf/mus_musculus/

There are 2 GTF files one with CHR in the name and one without. I was wondering which one should I use and how they are different. I have not figured this out just by opening the files.

Thank you,

If you are not sure too, would you please let me know which one you use for your analysis.

The one with CHR seems to have more lines and it seems to be scaffold genes. Am I missing something?

RNA-Seq STAR ensembl • 7.6k views
ADD COMMENT
3
Entering edit mode

They're the same, just one doesn't have the prefix. I often use the one without the 'chr' prefix, and when references needed to be mixed with those downloaded from the UCSC genome browser, I remove the prefix manually.

ADD REPLY
0
Entering edit mode

thank you very much

ADD REPLY
0
Entering edit mode

I don't think this is accurate. OP was referring to the presence or absence of chr in the file name, not the contig names. Please see A: Difference between GTF file with CHR and without CHR. ENSEMBL

ADD REPLY
18
Entering edit mode
4.1 years ago
mt1022 ▴ 250

The one without 'chr' contains annotations for genes on unplaced or unlocalized contigs, while the one with 'chr' only contains annotation for assembled chromosomes, both of them have no prefix 'chr' in chromosome name, see this example:

zcat Danio_rerio.GRCz10.87.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' 
KN150307.1 9
KN150451.1 7
KN150002.1 3
KN149765.1 3
KN149909.1 15
KN149998.1 13
#!genebuild-last-updated 1
KN150027.1 10
KN149917.1 7
KN150188.1 24
...
13 43069
20 48282
KN150399.1 11
KN150221.1 10
KN150670.1 32
KN149696.1 225
21 40566
...


zcat Danio_rerio.GRCz10.87.chr.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' | head
#!genebuild-last-updated 1
MT 147
#!genome-date 1
10 40589
11 40971
12 39371
13 43069
20 48282
21 40566
14 35910
ADD COMMENT
7
Entering edit mode

This answer is correct (official Ensembl response).

ADD REPLY
0
Entering edit mode

So which one should be used?

I am carrying out analysis in Homo sapiens and am doing a differential expression analysis of RNA-seq data.

ADD REPLY
0
Entering edit mode

In my opinion, use either one is OK for normal DE analysis. However, if you do not want to loose information about any annotated gene, use the one without 'chr'. (And sorry for the late response).

ADD REPLY
0
Entering edit mode

Is it on purpose that this difference is not explained in the README? (at least for Homo sapiens 102)

ADD REPLY
2
Entering edit mode
4.5 years ago
mastal511 ★ 2.1k

Use the file that names the chromosomes in the same way as they are named in your genome.fasta file, otherwise you will have problems.

ADD COMMENT
3
Entering edit mode

This answer is wrong (official Ensembl response).

ADD REPLY
0
Entering edit mode

Thank you very much

ADD REPLY

Login before adding your answer.

Traffic: 1677 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6