How to separate protein-coding and non-coding in gtf file
3.0 years ago
Hi,

In a gtf file I see "gene_type" column with different names like below. Among those names what all come under non-coding, protein_coding, lncRNA?

3prime_overlapping_ncRNA
IG_C_gene
IG_C_pseudogene
IG_D_gene
IG_J_gene
IG_J_pseudogene
IG_V_gene
IG_V_pseudogene
IG_pseudogene
Mt_rRNA
Mt_tRNA
TEC
TR_C_gene
TR_D_gene
TR_J_gene
TR_J_pseudogene
TR_V_gene
TR_V_pseudogene
antisense_RNA
bidirectional_promoter_lncRNA
lincRNA
macro_lncRNA
miRNA
misc_RNA
non_coding
polymorphic_pseudogene
processed_pseudogene
processed_transcript
protein_coding
pseudogene
rRNA
ribozyme
sRNA
scRNA
scaRNA
sense_intronic
sense_overlapping
snRNA
snoRNA
transcribed_processed_pseudogene
transcribed_unitary_pseudogene
transcribed_unprocessed_pseudogene
translated_processed_pseudogene
unitary_pseudogene
unprocessed_pseudogene
vaultRNA


I see the gene_type protein_coding. Are those only the protein_coding or should I also consider any other gene_type? What all come under non-coding? And lncRNA?

If you look at protein coding genes then yes you can only filter in the protein_coding type ones.

Here's an explanation for the different types found in ENSEMBL as suggested by i.sudbery : https://www.gencodegenes.org/gencode_biotypes.html

I might be wrong, but I think VEGA is now retired and the up to date reference for biotypes found in the recent (gencode based) ensembl builds is https://www.gencodegenes.org/gencode_biotypes.html

After looking on ENSEMBL website, you are right. I edit my answer.