In htseq count we need the .gtf file and in the tutorial they said we cannot use one from UCSC, is any one know the source to get the hg19.
EDIT: Post title edited by Ashutosh
Which tutorial? By the way, its best that you use the Ensembl GTF when running htseq-count.
i meant in the HTSeq 0.6.1p2 documentation
at the answer on one of the common question.
Thanks alot, I got one and it works with me.
I have moved the comment to answer.
You can also use GTF from gencode (I am using it without any problem). And by the way the GTF formats from any repository should work with HTSeq.
It is true that Gencode GTF works fine with htseq-count, I have used that as well. But I'd be cautious before saying that other formats (especially UCSC) works as well as Gencode and Ensembl. I have observed that some programs like the python scripts in DEXSeq & even some Cufflinks' programs like cuffcompare, work really well with Ensembl but not with Gencode.
Can you please post the errors which you get with Gencode GTF? So that it will be helpful for others to know about it and rectify. It would be great help if you can post (Also mention the Gencode version).
Santhilal Subhash Sure. Sometime soon.
Alright, so I found my own C: Warning in CuffDiff when using novel transcripts from CuffMerge as input that I posted a couple of months(?) back. I couldn't figure out what's wrong until I changed my GTF to Ensembl and things started chugging along. By the way, my pipeline got stuck at the differential expression stage using the cuffdiff program.
Quoted from DEXSeq Manual Section 2.4:
"We have tested our tools chiefly with GTF files from Ensembl and hence recommend to prefer these, as files from other providers sometimes do not adhere fully to the GTF standard and cause the preprocessing to fail."
komal.rathi and Santhilal Subhash
I just thought to share this that
htseq-count only reports one hit per aligned read, If a read is alligned for two different transcript then it is counted for same gene where it belongs to.
whatever GTF you use, your GTF file needs to indicate which transcripts belong to the same gene. e.g. exon lines from two transcripts of same same gene should have same gene_ID but different transcript_ID.
I know that we can not use UCSC table browser GTF because it has same gene_ID and transcript_ID, so htseq-count looses all those reads.
All we need to loook in our gtf is that gene_ID and transcript_ID is different then htseq-count works best
I am facing the same problem with HTSeq. I downloaded the GTF from UCSC genome browser. I am using NCBI's RefSeq (Human Transcriptome) as a reference. for this reference what is the best way to get the GTF file for HTSeq???
Thank you in advance.