Question: Difference between GTF file with CHR and without CHR. ENSEMBL
3
gravatar for rf
15 months ago by
rf20
Cambridge, MA
rf20 wrote:

I am very new to RNA-Seq. I am trying to align my samples with STAR. I am generating the genome index myself. Because I was hoping to add the spike-in sequence to the GTF and FASTA files.

I am downloading the GTF file from here: ftp://ftp.ensembl.org/pub/release-86/gtf/mus_musculus/

There are 2 GTF files one with CHR in the name and one without. I was wondering which one should I use and how they are different. I have not figured this out just by opening the files.

Thank you,

If you are not sure too, would you please let me know which one you use for your analysis.

The one with CHR seems to have more lines and it seems to be scaffold genes. Am I missing something?

rna-seq star ensembl • 1.3k views
ADD COMMENTlink modified 10 months ago by Emily_Ensembl14k • written 15 months ago by rf20
3

They're the same, just one doesn't have the prefix. I often use the one without the 'chr' prefix, and when references needed to be mixed with those downloaded from the UCSC genome browser, I remove the prefix manually.

ADD REPLYlink written 15 months ago by Eric Lim370

thank you very much

ADD REPLYlink written 15 months ago by rf20
2
gravatar for mastal511
15 months ago by
mastal5111.7k
mastal5111.7k wrote:

Use the file that names the chromosomes in the same way as they are named in your genome.fasta file, otherwise you will have problems.

ADD COMMENTlink written 15 months ago by mastal5111.7k
1

This answer is wrong (official Ensembl response).

ADD REPLYlink written 10 months ago by Emily_Ensembl14k

Thank you very much

ADD REPLYlink written 15 months ago by rf20
6
gravatar for mt1022
10 months ago by
mt1022120
China
mt1022120 wrote:

The one without 'chr' contains annotations for genes on unplaced or unlocalized contigs, while the one with 'chr' only contains annotation for assembled chromosomes, both of them have no prefix 'chr' in chromosome name, see this example:

zcat Danio_rerio.GRCz10.87.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' 
KN150307.1 9
KN150451.1 7
KN150002.1 3
KN149765.1 3
KN149909.1 15
KN149998.1 13
#!genebuild-last-updated 1
KN150027.1 10
KN149917.1 7
KN150188.1 24
...
13 43069
20 48282
KN150399.1 11
KN150221.1 10
KN150670.1 32
KN149696.1 225
21 40566
...


zcat Danio_rerio.GRCz10.87.chr.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' | head
#!genebuild-last-updated 1
MT 147
#!genome-date 1
10 40589
11 40971
12 39371
13 43069
20 48282
21 40566
14 35910
ADD COMMENTlink modified 10 months ago • written 10 months ago by mt1022120
4

This answer is correct (official Ensembl response).

ADD REPLYlink written 10 months ago by Emily_Ensembl14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1389 users visited in the last hour