Question: GTF file for reference
0
gravatar for KVC_bioinfo
19 months ago by
KVC_bioinfo380
Boston
KVC_bioinfo380 wrote:

Hello,

I have downloaded the reference for alignment of RNA-Seq with human transcriptome formThis link. I downloaded RefSeq transcripts from the link to use as a reference. I was not sure how do I get GTF file for this reference. I posted that question on Bio-stars a few days ago and I got an answer that I should download it from the UCSC table browser. So, I accordingly downloaded it from that source.

However, the GTF from table browser has sam egene_id and transcript_id which is not suitable for analysis using HTSeq So, I have a couple of questions here.

  1. what should I do in this case? I feel unsafe to edit GTF file
  2. Is there any other way to get GTF for specific reference I am looking for which will be compatible with HTSeq?
rna-seq refseq gtf • 1.3k views
ADD COMMENTlink modified 19 months ago by genecats.ucsc560 • written 19 months ago by KVC_bioinfo380
3
gravatar for Kevin Blighe
19 months ago by
Kevin Blighe42k
Republic of Ireland
Kevin Blighe42k wrote:

I would highly recommend the GENCODE GTF, whose information fields contain the gene symbols that you want. I am almost certain that it is compatible with HTSeq.

See here: http://www.gencodegenes.org/releases/current.html

[be sure to download the correct GTF for your genome build (GRCh37/hg19 or GRCh38/hg38)]

ADD COMMENTlink modified 19 months ago • written 19 months ago by Kevin Blighe42k

Thank you. I am using GRCh38. I followed the link you provided. So can I use "Comprehensive gene annotation" the very first file on that link when the reference used is Human transcriptome(NCBI's RefSeq transcripts)?????

ADD REPLYlink modified 19 months ago • written 19 months ago by KVC_bioinfo380
2

Yes, precisely.

Here is the direct link: ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/gencode.v27.annotation.gtf.gz

Here is the first record (DDX11L1 is 'always' the first gene, right at the beginning of the short arm of chr1)

chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";

ADD REPLYlink written 19 months ago by Kevin Blighe42k
1

Thank you very much. I was under the wrong impression that the GTF file for Human genome and Human transcriptome is different.

ADD REPLYlink written 19 months ago by KVC_bioinfo380
1

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLYlink written 19 months ago by WouterDeCoster39k

Thanks for the information!

ADD REPLYlink written 19 months ago by KVC_bioinfo380
3
gravatar for genecats.ucsc
19 months ago by
genecats.ucsc560
genecats.ucsc560 wrote:

If you would like to "edit" your UCSC Table Browser obtained GTF file, we have provided some utilities to do so: http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

The basic gist is to download your table of interest, chop off some columns (may or may not be necessary depending on the specific table), then run the genePredToGtf utility:

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "select * from refGene" hg19 | \
cut -f2- | genePredToGtf -source=hg19.refGene.ucsc file stdin stdout

Change stdout to the output filename you want in the last command to get an hg19 refGene GTF file:

chr1    hg19.refGene.ucsc   transcript  11869   14362   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357";  gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    11869   12227   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "1"; exon_id "NR_148357.1"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    12613   12721   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "2"; exon_id "NR_148357.2"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   exon    13221   14362   .   +   .   gene_id "LOC102725121"; transcript_id "NR_148357"; exon_number "3"; exon_id "NR_148357.3"; gene_name "LOC102725121";
chr1    hg19.refGene.ucsc   transcript  11874   14409   .   +   .   gene_id "DDX11L1"; transcript_id "NR_046018";  gene_name "DDX11L1";
...

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

  • General questions: genome@soe.ucsc.edu
  • Questions involving private data: genome-www@soe.ucsc.edu
  • Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser

ADD COMMENTlink written 19 months ago by genecats.ucsc560
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour