Obtaining a GTF with a required set of gene IDs
1
0
Entering edit mode
7.4 years ago
spacer • 0

Hi

I have counts produced by htseq-count. The gene_ids are in the format ENSG##########.#. I need a gtf file with information about all the genes in the htseq-count files which contain 59794 unique ids. None of the Ensembl gtf files that I can find contain this many gene_ids. I tried using UCSC Table Browser but there seems to be no option to download using the ENSG ids,

I would appreciate any help or guidance.

gtf htseq • 2.3k views
ADD COMMENT
0
Entering edit mode
7.4 years ago

A gencode GTF file was used with HTSeq-count, so go to Gencode.

ADD COMMENT
0
Entering edit mode

Unfortunately I do not have that file. I'm trying to reverse engineer someone elses work after a major data loss incident.

ADD REPLY
0
Entering edit mode

Which file? You have the htseq-count file and I told you that a GTF from Gencode was used, so go to the gencode website and try to figure out a reasonable version.

ADD REPLY
0
Entering edit mode

I have tried exactly that. The problem is this. Originally RPKMs were calculated for these counts. Now the researchers want TPM. Once I extract the lengths of the genes from the gtf file I am left with some 19000 gene ids. The gene lengths are calculated by adding the number of nucleotides inside exons, so I guess not all gene_ids in the gtf file have annotation of this kind.

I thought I could reverse engineer the RPKM file to extract the gene lengths. However, it seems that they are unsure of exactly what is in the RPKM file - it might be logs (or whatever). What I know is that the reverse engineering did not work so the numbers in the file are not pure RPKM. Thus I want a way of finding gene lengths for all the gene_ids in the count files so that I can calculate TPM.

ADD REPLY
0
Entering edit mode

All of the gene_ids should have exons. Post a few of the missing IDs for more help.

ADD REPLY
0
Entering edit mode

Below are some of the ids that appear to be missing:

ENSG00000280435.1 ENSG00000280436.1 ENSG00000280438.1 ENSG00000280439.1 ENSG00000280440.1 ENSG00000280441.1 ENSG00000280443.1 ENSG00000280444.1 ENSG00000280445.1 ENSG00000280446.1

Having grep'ed for ENSG00000280436.1 in gencode.v21.gtf, it only finds two exons, but these are shown as "unprocessed_pseudogenes" which is why they are not used for counting lengths.

So, I'll need to ask for some advice. I wrote a program to calculate the genelengths. My program is in java but I looked at a python program, gist.github.com/xiuchengquek/a6ef854eaea997a3566f to see how the genelength are calculated. My program takes minutes to run while the python program takes several hours on the same gtf file but we seem to get the same values for most of the genes. My program also takes the revision number in the gene_id into account while the python program doesn't. Like the python program my program looks for the gtf entry to be an exon and for the "transcript_type" to be "protein_coding". Since I seem to loose a lot IDs in this process, would it be a valid thing to do to also include the "unprocessed_pseudegenes"? I have no idea where and how lengths for the "missing gene ids" in the original analysis were found.

Hope this makes sense. Just ask if I should explain something better.

ADD REPLY
1
Entering edit mode

Don't bother filtering out unprocessed pseudogenes or TEC transcripts. I expect all of your unmatched genes will then go away.

You have a few different ways to go about getting gene lengths. Either the "union gene model" or "median transcript length" are the most common naive methods (i.e., what you can do when starting with counts). I've not bothered with writing a median transcript length version, but code for doing the "union gene model" is available here: https://github.com/dpryan79/Answers/tree/master/SEQanswers_42420

ADD REPLY
0
Entering edit mode

Thanks for this info. At this point it is easier for me to modify my program to include what I need. So I have two questions, if you don't mind. 1) Do the gene_type and transcript_type fields always contain same value? I grep'ed all the types and except for three transcript types that are not found in the gene_types, they are exactly the same. 2) Which types should I include. My program checks the type and then includes it.

I am however, going to familiarise myself with the methods you mentioned and I might amend my program to do both, even if it is just for the fun of coding it.

ADD REPLY
0
Entering edit mode
  1. No
  2. Neither, unless you're just doing this for annotation, in which case I guess the one for gene.
ADD REPLY
0
Entering edit mode

I should add that you if you already have the RPKMs then you can directly convert to TPMs. There's an equation at the bottom of this blog post from Harold Pimentel that shows how.

ADD REPLY
0
Entering edit mode

Now that would have been great if I was actually sure that the so called RPKMs that I had were actually RPKMs (probably FPKMs because the reads were paired end). But as I mentioned, it turned out they weren't pure ?PKMs because something else was done to them so I'm reluctant to do any further transformations on them.

ADD REPLY

Login before adding your answer.

Traffic: 1295 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6