Question: What does Ensembl/Genecode gene_type "processed_gene" really mean?
I am using the gencode.v19.annotation.gtf (head of the GTF below) to assign gene_types to the transcripts in my study via ensembl gene IDs. And example line from the GTF is also below.

Some gene_types have the name processed_transcript while other are lincRNA or antisense etc.

Ensembl just list the processed_transcript "biotype" under long non-coding transcript. That makes sense given I understand a processed transcript are those that do not have an ORF

But what is unclear to me is what is the difference between a processed_transcript and these other long non-coding transcripts? According to Vega, processed_transcript is above these other long ncRNAs in a hierarchy, which makes sense except I see many transcripts with this annotation and not on of the subtypes like lincRNA. Why would that be?

Based on what Genecode has written about biotypes, I guess something would be processed_transcript if it has no ORF and does not meat the criteria for other catagories like lincRNA or antisense. Does anyone know if this is true?


##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74) 
##provider: GENCODE
##format: gtf
##date: 2013-12-05

Example line:

chr1    HAVANA  gene    11869   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
ADD COMMENTlink modified 2.4 years ago by i.sudbery4.1k • written 2.4 years ago by james.lloyd80

For pseudogenes 'processed' vs 'nonprocessed' categories are well defined in the wikipedia link Processed involve retrotransposition and have cds like gene structure without introns.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by microfuge1.0k

Thanks but these are not for pseudogenes. They have a different gene_type in the databases which I understand the distinction.

ADD REPLYlink written 2.4 years ago by james.lloyd80
I think your assumption is right: "processed_transcript if it has no ORF and does not meat the criteria for other categories like lincRNA or antisense".

The processed transcript category would have been used long before annotating things that are now called antisense and lncRNA. There is more data and guidelines to annotate the latter now so they do not fall in the broad category of processed_transcript (i.e. transcripts that are 'processed' by the cell; they are spliced and can have a polyA tail added to them).

Everything that does not get classified as lncRNA and ncRNA will be tagged as processed_transcript (i.e. the unclassified in the VEGA help page).

ADD COMMENTlink written 2.4 years ago by Denise - Open Targets4.8k
My understanding was that "processed_transcripts" are non-coding transcripts associated with a gene that does have a coding isoform. For a transcript to be a lincRNA, it needs to be part an entirely non-coding gene.

ADD COMMENTlink written 2.4 years ago by i.sudbery4.1k

I think your understanding is right @i.sudbery. Perhaps my answer was a bit confusing...Processed transcripts can be either used as a gene type or a transcript type. One can indeed have a processed transcript in a locus that is coding. That's really common. In cases like that, the gene type will be protein_coding (not processed_transcript) and the non-coding transcript will be processed_transcript. One can also have a locus that gets both the gene and transcript types as 'processed_transcripts' (perhaps not too common e.g AC005614.5. The lncRNA is a transcript in a gene type that is classified by GENCODE as 'processed_transcript'. There should not be a lncRNA in a gene that is coding. I always find useful to look at those tricky cases using the browser, and BioMart can help out when trying to find these examples (just search for processed_transcripts in the FILTERS under gene type or transcript type).

ADD REPLYlink written 2.4 years ago by Denise - Open Targets4.8k

It is a little confusing that a gene biotype can contain the word "transcript" and contain multiple transcripts whose biotype is not "processed_transcript".

processed_transcript implies that it refers to a transcript, and a single transcript at that.

ADD REPLYlink written 2.4 years ago by i.sudbery4.1k
