Difference between GTF file with CHR and without CHR. ENSEMBL
2
7
Entering edit mode
6.3 years ago
rf ▴ 60

I am very new to RNA-Seq. I am trying to align my samples with STAR. I am generating the genome index myself. Because I was hoping to add the spike-in sequence to the GTF and FASTA files.

There are 2 GTF files one with CHR in the name and one without. I was wondering which one should I use and how they are different. I have not figured this out just by opening the files.

Thank you,

If you are not sure too, would you please let me know which one you use for your analysis.

The one with CHR seems to have more lines and it seems to be scaffold genes. Am I missing something?

RNA-Seq STAR ensembl • 11k views
3
Entering edit mode

They're the same, just one doesn't have the prefix. I often use the one without the 'chr' prefix, and when references needed to be mixed with those downloaded from the UCSC genome browser, I remove the prefix manually.

1
Entering edit mode

I don't think this is accurate. OP was referring to the presence or absence of chr in the file name, not the contig names. Please see A: Difference between GTF file with CHR and without CHR. ENSEMBL

0
Entering edit mode

thank you very much

22
Entering edit mode
5.9 years ago
mt1022 ▴ 290

The one without 'chr' contains annotations for genes on unplaced or unlocalized contigs, while the one with 'chr' only contains annotation for assembled chromosomes, both of them have no prefix 'chr' in chromosome name, see this example:

zcat Danio_rerio.GRCz10.87.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' KN150307.1 9 KN150451.1 7 KN150002.1 3 KN149765.1 3 KN149909.1 15 KN149998.1 13 #!genebuild-last-updated 1 KN150027.1 10 KN149917.1 7 KN150188.1 24 ... 13 43069 20 48282 KN150399.1 11 KN150221.1 10 KN150670.1 32 KN149696.1 225 21 40566 ... zcat Danio_rerio.GRCz10.87.chr.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' | head
#!genebuild-last-updated 1
MT 147
#!genome-date 1
10 40589
11 40971
12 39371
13 43069
20 48282
21 40566
14 35910

7
Entering edit mode

This answer is correct (official Ensembl response).

0
Entering edit mode

So which one should be used?

I am carrying out analysis in Homo sapiens and am doing a differential expression analysis of RNA-seq data.

0
Entering edit mode

In my opinion, use either one is OK for normal DE analysis. However, if you do not want to loose information about any annotated gene, use the one without 'chr'. (And sorry for the late response).

0
Entering edit mode

Is it on purpose that this difference is not explained in the README? (at least for Homo sapiens 102)

2
Entering edit mode
6.3 years ago
mastal511 ★ 2.1k

Use the file that names the chromosomes in the same way as they are named in your genome.fasta file, otherwise you will have problems.

3
Entering edit mode

This answer is wrong (official Ensembl response).

0
Entering edit mode

Thank you very much