Question: GTF file for reference
0
gravatar for KVC_bioinfo
2.2 years ago by
KVC_bioinfo410
Boston
KVC_bioinfo410 wrote:

Hello,

I have downloaded the reference for alignment of RNA-Seq with human transcriptome formThis link. I downloaded RefSeq transcripts from the link to use as a reference. I was not sure how do I get GTF file for this reference. I posted that question on Bio-stars a few days ago and I got an answer that I should download it from the UCSC table browser. So, I accordingly downloaded it from that source.

However, the GTF from table browser has sam egene_id and transcript_id which is not suitable for analysis using HTSeq So, I have a couple of questions here.

  1. what should I do in this case? I feel unsafe to edit GTF file
  2. Is there any other way to get GTF for specific reference I am looking for which will be compatible with HTSeq?
rna-seq refseq gtf • 1.7k views
ADD COMMENTlink modified 2.2 years ago by genecats.ucsc570 • written 2.2 years ago by KVC_bioinfo410
3
gravatar for Kevin Blighe
2.2 years ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

I would highly recommend the GENCODE GTF, whose information fields contain the gene symbols that you want. I am almost certain that it is compatible with HTSeq.

See here: http://www.gencodegenes.org/releases/current.html

[be sure to download the correct GTF for your genome build (GRCh37/hg19 or GRCh38/hg38)]

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe52k

Thank you. I am using GRCh38. I followed the link you provided. So can I use "Comprehensive gene annotation" the very first file on that link when the reference used is Human transcriptome(NCBI's RefSeq transcripts)?????

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by KVC_bioinfo410
2

Yes, precisely.

Here is the direct link: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz

Here is the first record (DDX11L1 is 'always' the first gene, right at the beginning of the short arm of chr1)

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

ADD REPLYlink written 2.2 years ago by Kevin Blighe52k
1

Thank you very much. I was under the wrong impression that the GTF file for Human genome and Human transcriptome is different.

ADD REPLYlink written 2.2 years ago by KVC_bioinfo410
1

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 2.2 years ago by WouterDeCoster42k

Thanks for the information!

ADD REPLYlink written 2.2 years ago by KVC_bioinfo410
3
gravatar for genecats.ucsc
2.2 years ago by
genecats.ucsc570
genecats.ucsc570 wrote:

If you would like to "edit" your UCSC Table Browser obtained GTF file, we have provided some utilities to do so: http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

The basic gist is to download your table of interest, chop off some columns (may or may not be necessary depending on the specific table), then run the genePredToGtf utility:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from refGene" hg19 | \
cut -f2- | genePredToGtf -source=hg19.refGene.ucsc file stdin stdout

Change stdout to the output filename you want in the last command to get an hg19 refGene GTF file:

chr1    hg19.refGene.ucsc   transcript  11869   14362   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357";  gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    11869   12227   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "1"; exon_id "NR_148357.1"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    12613   12721   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "2"; exon_id "NR_148357.2"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    13221   14362   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "3"; exon_id "NR_148357.3"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   transcript  11874   14409   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018";  gene_name "DDX11L1";
...

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

  • General questions: genome@soe.ucsc.edu
  • Questions involving private data: genome-www@soe.ucsc.edu
  • Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser

ADD COMMENTlink written 2.2 years ago by genecats.ucsc570
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1769 users visited in the last hour