number of genes in human genome
4
0
Entering edit mode
6.6 years ago

Hi Biostars,

To my knowledge there are around 20k genes in human genome. But in ensembl gtf file number of gene_ids is around 60k. Are these 40k gene_ids pseudogenes or I am missing something?

Thanks,

gene_id Ensembl • 3.3k views
ADD COMMENT
1
Entering edit mode

That's a big question and people have different ideas and opinions on it - you'll never get a definitive answer. From ENCODE (https://www.encodeproject.org/) and FANTOM5 (http://fantom.gsc.riken.jp/5/), we can say that upward of 200,000 regions of the genome are transcribed into mRNA. The majority of these do not code for a protein and are non-coding.

More than 50% of the genome also exhibits some level of homology, which includes processed pseudogenes (where just the mRNA sequence is copied elsewhere) and unprocessed pseudogenes (where the entire gene, or parts of it, including introns is copied elsewhere).

The definition of 'gene' itself needs to be considered. Non-coding RNAs that have proven function are regarded as single-exon genes - it's just that they are never translated into proteins.

The count from the most recent RNA-seq experiment that I did on a few hundred samples was 199,169 different mRNA transcripts.

ADD REPLY
0
Entering edit mode

Thanks! My confusion was actually in terminology, did't realize for the moment that gene_id tag can refer to other things than protein-coding genes. My mistake!

ADD REPLY
4
Entering edit mode
6.6 years ago

a quick check:

$ curl -s "ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.gtf.gz" | gunzip -c | awk '($3=="gene")' | cut -f 9 | tr ";" "\n" | grep gene_biotype | sed 's/gene_biotype//' | sort | uniq -c | sort -rn
  19847   "protein_coding"
  10235   "processed_pseudogene"
   7493   "lincRNA"
   5517   "antisense_RNA"
   2637   "unprocessed_pseudogene"
   2221   "misc_RNA"
   1909   "snRNA"
   1879   "miRNA"
   1066   "TEC"
    943   "snoRNA"
    904   "sense_intronic"
    828   "transcribed_unprocessed_pseudogene"
    549   "rRNA"
    543   "processed_transcript"
    462   "transcribed_processed_pseudogene"
    189   "sense_overlapping"
    188   "IG_V_pseudogene"
    144   "IG_V_gene"
    111   "transcribed_unitary_pseudogene"
    108   "TR_V_gene"
     95   "unitary_pseudogene"
     79   "TR_J_gene"
     63   "polymorphic_pseudogene"
     49   "scaRNA"
     37   "IG_D_gene"
     31   "3prime_overlapping_ncRNA"
     30   "TR_V_pseudogene"
     22   "pseudogene"
     22   "Mt_tRNA"
     19   "bidirectional_promoter_lncRNA"
     18   "IG_J_gene"
     14   "IG_C_gene"
      9   "IG_C_pseudogene"
      8   "ribozyme"
      6   "TR_C_gene"
      5   "sRNA"
      4   "TR_J_pseudogene"
      4   "TR_D_gene"
      3   "non_coding"
      3   "IG_J_pseudogene"
      2   "translated_processed_pseudogene"
      2   "Mt_rRNA"
      1   "vaultRNA"
      1   "scRNA"
      1   "macro_lncRNA"
      1   "IG_pseudogene"
ADD COMMENT
1
Entering edit mode

Current GENCODE annotation has (for GRCh38.p10):

gunzip -c gencode.v27.annotation.gtf.gz | awk '($3=="gene")' | cut -f 9 | tr ";" "\n" | grep gene_type | sed 's/gene_type//' | sort | uniq -c | sort -rn
  19836   "protein_coding"
  10240   "processed_pseudogene"
   7499   "lincRNA"
   5521   "antisense_RNA"
   2639   "unprocessed_pseudogene"
   2213   "misc_RNA"
   1900   "snRNA"
   1881   "miRNA"
   1066   "TEC"
    943   "snoRNA"
    905   "sense_intronic"
    830   "transcribed_unprocessed_pseudogene"
    544   "rRNA"
    544   "processed_transcript"
    462   "transcribed_processed_pseudogene"
    189   "sense_overlapping"
    188   "IG_V_pseudogene"
    144   "IG_V_gene"
    111   "transcribed_unitary_pseudogene"
    108   "TR_V_gene"
     95   "unitary_pseudogene"
     79   "TR_J_gene"
     63   "polymorphic_pseudogene"
     49   "scaRNA"
     37   "IG_D_gene"
     31   "3prime_overlapping_ncRNA"
     30   "TR_V_pseudogene"
     22   "Mt_tRNA"
     19   "bidirectional_promoter_lncRNA"
     18   "pseudogene"
     18   "IG_J_gene"
     14   "IG_C_gene"
      9   "IG_C_pseudogene"
      8   "ribozyme"
      6   "TR_C_gene"
      5   "sRNA"
      4   "TR_J_pseudogene"
      4   "TR_D_gene"
      3   "non_coding"
      3   "IG_J_pseudogene"
      2   "translated_processed_pseudogene"
      2   "Mt_rRNA"
      1   "vaultRNA"
      1   "scRNA"
      1   "macro_lncRNA"
      1   "IG_pseudogene"
ADD REPLY
0
Entering edit mode

Perfect, thank you! Seems like magic :)

ADD REPLY
5
Entering edit mode
6.6 years ago
Emily 23k

20,338 protein coding genes on the primary assembly. 22,521 non-coding genes 14,638 pseudogenes

= 56497 genes on the primary assembly

On the alternate assemblies (haplotypes and patches): 2,750 protein coding 1,288 non-coding 1,600 pseudogenes

= 5038 genes on the alternate assemblies.

Total genes = 61535

Overall, this means that any given person will have ~20k coding genes, but more non-coding and pseudogenes. Another person will have a slightly different set of ~20k genes.

ADD COMMENT
2
Entering edit mode
6.6 years ago
EagleEye 7.5k

There are also categories other than protein coding genes (majority: non-coding RNAs) in the genome. Here is the summary from recent version of human genome.

ADD COMMENT

Login before adding your answer.

Traffic: 1976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6