Question: Difference between GTF file with CHR and without CHR. ENSEMBL
4
gravatar for rf
2.1 years ago by
rf30
Cambridge, MA
rf30 wrote:

I am very new to RNA-Seq. I am trying to align my samples with STAR. I am generating the genome index myself. Because I was hoping to add the spike-in sequence to the GTF and FASTA files.

I am downloading the GTF file from here: ftp://ftp.ensembl.org/pub/release-86/gtf/mus_musculus/

There are 2 GTF files one with CHR in the name and one without. I was wondering which one should I use and how they are different. I have not figured this out just by opening the files.

Thank you,

If you are not sure too, would you please let me know which one you use for your analysis.

The one with CHR seems to have more lines and it seems to be scaffold genes. Am I missing something?

rna-seq star ensembl • 3.1k views
ADD COMMENTlink modified 21 months ago by Emily_Ensembl16k • written 2.1 years ago by rf30
3

They're the same, just one doesn't have the prefix. I often use the one without the 'chr' prefix, and when references needed to be mixed with those downloaded from the UCSC genome browser, I remove the prefix manually.

ADD REPLYlink written 2.1 years ago by Eric Lim1.1k

thank you very much

ADD REPLYlink written 2.1 years ago by rf30
2
gravatar for mastal511
2.1 years ago by
mastal5112.0k
mastal5112.0k wrote:

Use the file that names the chromosomes in the same way as they are named in your genome.fasta file, otherwise you will have problems.

ADD COMMENTlink written 2.1 years ago by mastal5112.0k
1

This answer is wrong (official Ensembl response).

ADD REPLYlink written 21 months ago by Emily_Ensembl16k

Thank you very much

ADD REPLYlink written 2.1 years ago by rf30
12
gravatar for mt1022
21 months ago by
mt1022180
China
mt1022180 wrote:

The one without 'chr' contains annotations for genes on unplaced or unlocalized contigs, while the one with 'chr' only contains annotation for assembled chromosomes, both of them have no prefix 'chr' in chromosome name, see this example:

zcat Danio_rerio.GRCz10.87.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' 
KN150307.1 9
KN150451.1 7
KN150002.1 3
KN149765.1 3
KN149909.1 15
KN149998.1 13
#!genebuild-last-updated 1
KN150027.1 10
KN149917.1 7
KN150188.1 24
...
13 43069
20 48282
KN150399.1 11
KN150221.1 10
KN150670.1 32
KN149696.1 225
21 40566
...


zcat Danio_rerio.GRCz10.87.chr.gtf.gz | cut -f1 | awk '{dict[$1]++}END{for(i in dict) print i, dict[i]}' | head
#!genebuild-last-updated 1
MT 147
#!genome-date 1
10 40589
11 40971
12 39371
13 43069
20 48282
21 40566
14 35910
ADD COMMENTlink modified 21 months ago • written 21 months ago by mt1022180
4

This answer is correct (official Ensembl response).

ADD REPLYlink written 21 months ago by Emily_Ensembl16k

So which one should be used?

I am carrying out analysis in Homo sapiens and am doing a differential expression analysis of RNA-seq data.

ADD REPLYlink modified 10 months ago • written 10 months ago by amina.mcdiarmid0

In my opinion, use either one is OK for normal DE analysis. However, if you do not want to loose information about any annotated gene, use the one without 'chr'. (And sorry for the late response).

ADD REPLYlink modified 9 months ago • written 9 months ago by mt1022180
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1200 users visited in the last hour