I have a question regarding the GTF file which is to be used in HTSEQ. can we use the GTF downloaded by genome UCSC table browser?
I downloaded the GTF from UCSC genome browser. I am using NCBI's RefSeq (Human Transcriptome) as a reference. for this reference what is the best way to get the GTF file for HTSeq?
Yes, you can, but the results won't be good - from the FAQ:
I have used a GTF file generated by the Table Browser function of the
UCSC Genome Browser, and most reads are counted as ambiguous. Why?
In these files, the gene_id attribute incorrectly contains the same value as the transcript_id attribute and hence a different value
for each transcript of the same gene. Hence, if a read maps to an exon
shared by several transcripts of the same gene, this will appear to
htseq-count as and overlap with several genes. Therefore, these GTF
files cannot be used as is. Either correct the incorrect gene_id
attributes with a suitable script, or use a GTF file from a different
source.
# first , download a table for "Genes and Gene Predictions" from the UCSC Table
# Browser indicating as the output format: "all fields from selected table"
# NOTE: this may not work for all GTF files downloaded from UCSC! genePredToGtf
# is very finicky and every organism's annotation may have been generated and
# deposited by a different person)
head -n1 allfields_hg19.txt
# remove first column and first line , feed that into genePredToGtf
cut -f 2- allfields_hg19.txt | sed '1d' | \
genePredToGtf file stdin hg19_RefSeq.gtf
head -n1 hg19_RefSeq.gtf
Yes. But how can we make changes in the GTF file we have from tale browser?