Question: GTF file for HTSeq
gravatar for KVC_bioinfo
19 months ago by
KVC_bioinfo380 wrote:


I have a question regarding the GTF file which is to be used in HTSEQ. can we use the GTF downloaded by genome UCSC table browser? I downloaded the GTF from UCSC genome browser. I am using NCBI's RefSeq (Human Transcriptome) as a reference. for this reference what is the best way to get the GTF file for HTSeq???

Thank you in advance.

#htseq rna-seq #gtf • 1.1k views
ADD COMMENTlink modified 19 months ago • written 19 months ago by KVC_bioinfo380
gravatar for h.mon
19 months ago by
h.mon25k wrote:

Yes, you can, but the results won't be good - from the FAQ:

I have used a GTF file generated by the Table Browser function of the UCSC Genome Browser, and most reads are counted as ambiguous. Why?

In these files, the gene_id attribute incorrectly contains the same value as the transcript_id attribute and hence a different value for each transcript of the same gene. Hence, if a read maps to an exon shared by several transcripts of the same gene, this will appear to htseq-count as and overlap with several genes. Therefore, these GTF files cannot be used as is. Either correct the incorrect gene_id attributes with a suitable script, or use a GTF file from a different source.

edit: Try the following solution, taken from the excellent Introduction to differential gene expression analysis using RNA-seq:

# first , download a table  for "Genes  and  Gene  Predictions" from  the  UCSC  Table 
# Browser  indicating  as the  output  format: "all  fields  from  selected  table"
# NOTE: this  may not  work  for all GTF  files  downloaded  from  UCSC! genePredToGtf
# is very  finicky  and  every  organism's annotation  may  have  been  generated  and
# deposited  by a different  person)

head -n1  allfields_hg19.txt

# remove  first  column  and  first line , feed  that  into  genePredToGtf

cut -f 2- allfields_hg19.txt | sed '1d' | \
genePredToGtf file stdin hg19_RefSeq.gtf

head -n1  hg19_RefSeq.gtf
ADD COMMENTlink modified 19 months ago • written 19 months ago by h.mon25k

Yes. But how can we make changes in the GTF file we have from tale browser?

ADD REPLYlink written 19 months ago by KVC_bioinfo380
  1. In hgtables, Choose NCBI refseq as track and in output format,
  2. Choose selected fields from primary and related tables and click on get output.It would take you to a page where you can choose the output.
  3. Select name and name2 fields and click on get output. This would output transcript name and gene name.
  4. Now replace genid (in UCSC gtf) with gene symbol from freshly exported list.
  5. Script might be available in one of the posts on biostar or request awk experts here :)
ADD REPLYlink modified 19 months ago • written 19 months ago by cpad011211k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1869 users visited in the last hour