Question

Difference between GTF file with CHR and without CHR. ENSEMBL

7

Entering edit mode

7.5 years ago

rf ▴ 60

I am very new to RNA-Seq. I am trying to align my samples with STAR. I am generating the genome index myself. Because I was hoping to add the spike-in sequence to the GTF and FASTA files.

I am downloading the GTF file from here: ftp://ftp.ensembl.org/pub/release-86/gtf/mus_musculus/

There are 2 GTF files one with CHR in the name and one without. I was wondering which one should I use and how they are different. I have not figured this out just by opening the files.

Thank you,

If you are not sure too, would you please let me know which one you use for your analysis.

The one with CHR seems to have more lines and it seems to be scaffold genes. Am I missing something?

RNA-Seq STAR ensembl • 14k views

ADD COMMENT • link updated 7.1 years ago by Emily 23k • written 7.5 years ago by rf ▴ 60

3

Entering edit mode

They're the same, just one doesn't have the prefix. I often use the one without the 'chr' prefix, and when references needed to be mixed with those downloaded from the UCSC genome browser, I remove the prefix manually.

ADD REPLY • link 7.5 years ago by Eric Lim ★ 2.1k

1

Entering edit mode

I don't think this is accurate. OP was referring to the presence or absence of chr in the file name, not the contig names. Please see A: Difference between GTF file with CHR and without CHR. ENSEMBL

ADD REPLY • link 4.8 years ago by Ram 43k

0

Entering edit mode

thank you very much

ADD REPLY • link 7.5 years ago by rf ▴ 60

2

Entering edit mode

7.5 years ago

mastal511 ★ 2.1k

Use the file that names the chromosomes in the same way as they are named in your genome.fasta file, otherwise you will have problems.

ADD COMMENT • link 7.5 years ago by mastal511 ★ 2.1k

3

Entering edit mode

This answer is wrong (official Ensembl response).

ADD REPLY • link 7.1 years ago by Emily 23k

0

Entering edit mode

Thank you very much

ADD REPLY • link 7.5 years ago by rf ▴ 60

score 24 · Accepted Answer · 2017-03-13

24

Entering edit mode

7.1 years ago

mt1022 ▴ 310

The one without 'chr' contains annotations for genes on unplaced or unlocalized contigs, while the one with 'chr' only contains annotation for assembled chromosomes, both of them have no prefix 'chr' in chromosome name, see this example:

zcat Danio_rerio.GRCz10.87.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' 
KN150307.1 9
KN150451.1 7
KN150002.1 3
KN149765.1 3
KN149909.1 15
KN149998.1 13
#!genebuild-last-updated 1
KN150027.1 10
KN149917.1 7
KN150188.1 24
...
13 43069
20 48282
KN150399.1 11
KN150221.1 10
KN150670.1 32
KN149696.1 225
21 40566
...


zcat Danio_rerio.GRCz10.87.chr.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' | head
#!genebuild-last-updated 1
MT 147
#!genome-date 1
10 40589
11 40971
12 39371
13 43069
20 48282
21 40566
14 35910

ADD COMMENT • link 7.1 years ago by mt1022 ▴ 310

7

Entering edit mode

This answer is correct (official Ensembl response).

ADD REPLY • link 7.1 years ago by Emily 23k

0

Entering edit mode

So which one should be used?

I am carrying out analysis in Homo sapiens and am doing a differential expression analysis of RNA-seq data.

ADD REPLY • link 6.2 years ago by amina.mcdiarmid ▴ 10

0

Entering edit mode

In my opinion, use either one is OK for normal DE analysis. However, if you do not want to loose information about any annotated gene, use the one without 'chr'. (And sorry for the late response).

ADD REPLY • link 6.2 years ago by mt1022 ▴ 310

0

Entering edit mode

Is it on purpose that this difference is not explained in the README? (at least for Homo sapiens 102)

ADD REPLY • link 3.4 years ago by kathka ▴ 30