Question: UCSC hg19 gtf (genePredToGtf OS incompatibility)
2
gravatar for umn_bist
3.3 years ago by
umn_bist320
umn_bist320 wrote:

So I found that the annotated GTF file for hg19 from UCSC table does not adhere to the standard GTF format. Thus, I've been getting a fatal error in STAR:

Fatal INPUT FILE error, no valid exon lines in the GTF file: /work/cellbiology/s167125/Documents/ucsc_hg19/ucsc.hg19.gtf
Solution: check the formatting of the GTF file. Most likely cause is the difference in chromosome naming between GTF and FASTA file.

I know that I can retrieve a good GTF file via genePredToGtf application but this is only compatible with Linux 64. I only have access to a Mac. I am wondering if there is an alternative method to retrieve a GTF for UCSC's hg19 reference genome.

Thank you for the help

rna-seq genome • 3.9k views
ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by umn_bist320
2

Is there a reason you want to use the UCSC annotation? The one from Ensembl/Gencode is almost always better (there's a reason that UCSC now uses the copy from gencode).

ADD REPLYlink written 3.3 years ago by Devon Ryan90k

Yes, so I checked the header of my refgenome (ucsc_hg19.fa) as well as its annotated gtf file (ucsc_hg19.gtf) and it uses 'chr' notation.

Digging further, I realized UCSC does not keep a GTF file of its gene structures - they are all in GenePred Format.

ADD REPLYlink written 3.3 years ago by umn_bist320

You can export the UCSC gene predictions in GTF from the table browser.

ADD REPLYlink written 3.3 years ago by Vivek2.2k

That is what I thought as well, but see this wiki page

UCSC does not keep gene structures in GTF format, we use a single line format for a single gene with all the information about that gene in the single line: GenePred format.

Extracting GTF format files from the genePred format can be performed with the genePredToGtf: kent command utility.

At this time, this genePredToGtf command can provide better GTF files than available from the table browser.
ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by umn_bist320

To be honest, no. It's just something I had on hand and had generated the index using STAR already. I found that Alex Dobin of STAR recommends using genecode.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by umn_bist320

Yup, Gencode/Ensembl (they're more or less identical) are what you'll find most people (myself included) recommending.

ADD REPLYlink written 3.3 years ago by Devon Ryan90k

@Devon Ryan, Could you say a bit more about why Ensemble annotation is better than UCSC's? Thanks!

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by epigene450

It's more likely to represent the transcripts you see in your experiments.

ADD REPLYlink written 2.8 years ago by Devon Ryan90k

@Devon Ryan, because Ensembl people curate the annotation better?

ADD REPLYlink written 2.8 years ago by epigene450

Ensembl and UCSC use completely different methods to arrive at the annotations (historically, at least for recent mouse and human annotations they should be the same).

ADD REPLYlink written 2.8 years ago by Devon Ryan90k

It says the most likely issue is the chromosome naming convention. So it could be as simple as adding or removing a "chr" from the GTF or reference file.

ADD REPLYlink written 3.3 years ago by Vivek2.2k
0
gravatar for umn_bist
3.3 years ago by
umn_bist320
umn_bist320 wrote:

Deleted. See comment above.

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by umn_bist320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 796 users visited in the last hour