GTF file for HTSeq
1
0
Entering edit mode
8.0 years ago
KVC_bioinfo ▴ 610

Hello,

I have a question regarding the GTF file which is to be used in HTSEQ. can we use the GTF downloaded by genome UCSC table browser?

I downloaded the GTF from UCSC genome browser. I am using NCBI's RefSeq (Human Transcriptome) as a reference. for this reference what is the best way to get the GTF file for HTSeq?

Thank you in advance

htseq RNA-Seq GTF • 3.8k views
ADD COMMENT
1
Entering edit mode
8.0 years ago
h.mon 35k

Yes, you can, but the results won't be good - from the FAQ:

I have used a GTF file generated by the Table Browser function of the UCSC Genome Browser, and most reads are counted as ambiguous. Why?

In these files, the gene_id attribute incorrectly contains the same value as the transcript_id attribute and hence a different value for each transcript of the same gene. Hence, if a read maps to an exon shared by several transcripts of the same gene, this will appear to htseq-count as and overlap with several genes. Therefore, these GTF files cannot be used as is. Either correct the incorrect gene_id attributes with a suitable script, or use a GTF file from a different source.

edit: Try the following solution, taken from the excellent Introduction to differential gene expression analysis using RNA-seq:

# first , download a table  for "Genes  and  Gene  Predictions" from  the  UCSC  Table 
# Browser  indicating  as the  output  format: "all  fields  from  selected  table"
# NOTE: this  may not  work  for all GTF  files  downloaded  from  UCSC! genePredToGtf
# is very  finicky  and  every  organism's annotation  may  have  been  generated  and
# deposited  by a different  person)

head -n1  allfields_hg19.txt

# remove  first  column  and  first line , feed  that  into  genePredToGtf

cut -f 2- allfields_hg19.txt | sed '1d' | \
genePredToGtf file stdin hg19_RefSeq.gtf

head -n1  hg19_RefSeq.gtf
ADD COMMENT
0
Entering edit mode

Yes. But how can we make changes in the GTF file we have from tale browser?

ADD REPLY
0
Entering edit mode
  1. In hgtables, Choose NCBI refseq as track and in output format,
  2. Choose selected fields from primary and related tables and click on get output.It would take you to a page where you can choose the output.
  3. Select name and name2 fields and click on get output. This would output transcript name and gene name.
  4. Now replace genid (in UCSC gtf) with gene symbol from freshly exported list.
  5. Script might be available in one of the posts on biostar or request awk experts here :)
ADD REPLY

Login before adding your answer.

Traffic: 4466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6