Question: Obtaining a GTF with a required set of gene IDs
0
gravatar for spacer
2.9 years ago by
spacer0
spacer0 wrote:

Hi

I have counts produced by htseq-count. The gene_ids are in the format ENSG##########.#. I need a gtf file with information about all the genes in the htseq-count files which contain 59794 unique ids. None of the Ensembl gtf files that I can find contain this many gene_ids. I tried using UCSC Table Browser but there seems to be no option to download using the ENSG ids,

I would appreciate any help or guidance.

htseq gtf • 1.0k views
ADD COMMENTlink modified 2.9 years ago by Devon Ryan92k • written 2.9 years ago by spacer0
0
gravatar for Devon Ryan
2.9 years ago by
Devon Ryan92k
Freiburg, Germany
Devon Ryan92k wrote:

A gencode GTF file was used with HTSeq-count, so go to Gencode.

ADD COMMENTlink written 2.9 years ago by Devon Ryan92k

Unfortunately I do not have that file. I'm trying to reverse engineer someone elses work after a major data loss incident.

ADD REPLYlink written 2.9 years ago by spacer0

Which file? You have the htseq-count file and I told you that a GTF from Gencode was used, so go to the gencode website and try to figure out a reasonable version.

ADD REPLYlink written 2.9 years ago by Devon Ryan92k

I have tried exactly that. The problem is this. Originally RPKMs were calculated for these counts. Now the researchers want TPM. Once I extract the lengths of the genes from the gtf file I am left with some 19000 gene ids. The gene lengths are calculated by adding the number of nucleotides inside exons, so I guess not all gene_ids in the gtf file have annotation of this kind.

I thought I could reverse engineer the RPKM file to extract the gene lengths. However, it seems that they are unsure of exactly what is in the RPKM file - it might be logs (or whatever). What I know is that the reverse engineering did not work so the numbers in the file are not pure RPKM. Thus I want a way of finding gene lengths for all the gene_ids in the count files so that I can calculate TPM.

ADD REPLYlink written 2.9 years ago by spacer0

All of the gene_ids should have exons. Post a few of the missing IDs for more help.

ADD REPLYlink written 2.9 years ago by Devon Ryan92k

Below are some of the ids that appear to be missing:

ENSG00000280435.1 ENSG00000280436.1 ENSG00000280438.1 ENSG00000280439.1 ENSG00000280440.1 ENSG00000280441.1 ENSG00000280443.1 ENSG00000280444.1 ENSG00000280445.1 ENSG00000280446.1

Having grep'ed for ENSG00000280436.1 in gencode.v21.gtf, it only finds two exons, but these are shown as "unprocessed_pseudogenes" which is why they are not used for counting lengths.

So, I'll need to ask for some advice. I wrote a program to calculate the genelengths. My program is in java but I looked at a python program, gist.github.com/xiuchengquek/a6ef854eaea997a3566f to see how the genelength are calculated. My program takes minutes to run while the python program takes several hours on the same gtf file but we seem to get the same values for most of the genes. My program also takes the revision number in the gene_id into account while the python program doesn't. Like the python program my program looks for the gtf entry to be an exon and for the "transcript_type" to be "protein_coding". Since I seem to loose a lot IDs in this process, would it be a valid thing to do to also include the "unprocessed_pseudegenes"? I have no idea where and how lengths for the "missing gene ids" in the original analysis were found.

Hope this makes sense. Just ask if I should explain something better.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by spacer0
1

Don't bother filtering out unprocessed pseudogenes or TEC transcripts. I expect all of your unmatched genes will then go away.

You have a few different ways to go about getting gene lengths. Either the "union gene model" or "median transcript length" are the most common naive methods (i.e., what you can do when starting with counts). I've not bothered with writing a median transcript length version, but code for doing the "union gene model" is available here: https://github.com/dpryan79/Answers/tree/master/SEQanswers_42420

ADD REPLYlink written 2.9 years ago by Devon Ryan92k

Thanks for this info. At this point it is easier for me to modify my program to include what I need. So I have two questions, if you don't mind. 1) Do the gene_type and transcript_type fields always contain same value? I grep'ed all the types and except for three transcript types that are not found in the gene_types, they are exactly the same. 2) Which types should I include. My program checks the type and then includes it.

I am however, going to familiarise myself with the methods you mentioned and I might amend my program to do both, even if it is just for the fun of coding it.

ADD REPLYlink written 2.9 years ago by spacer0
  1. No
  2. Neither, unless you're just doing this for annotation, in which case I guess the one for gene.
ADD REPLYlink written 2.9 years ago by Devon Ryan92k

I should add that you if you already have the RPKMs then you can directly convert to TPMs. There's an equation at the bottom of this blog post from Harold Pimentel that shows how.

ADD REPLYlink written 2.9 years ago by Devon Ryan92k

Now that would have been great if I was actually sure that the so called RPKMs that I had were actually RPKMs (probably FPKMs because the reads were paired end). But as I mentioned, it turned out they weren't pure ?PKMs because something else was done to them so I'm reluctant to do any further transformations on them.

ADD REPLYlink written 2.9 years ago by spacer0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1122 users visited in the last hour