Question

number of genes in human genome

0

Entering edit mode

6.6 years ago

grant.hovhannisyan ★ 2.6k

Hi Biostars,

To my knowledge there are around 20k genes in human genome. But in ensembl gtf file number of gene_ids is around 60k. Are these 40k gene_ids pseudogenes or I am missing something?

Thanks,

gene_id Ensembl • 3.3k views

ADD COMMENT • link updated 6.6 years ago by Emily 23k • written 6.6 years ago by grant.hovhannisyan ★ 2.6k

1

Entering edit mode

That's a big question and people have different ideas and opinions on it - you'll never get a definitive answer. From ENCODE (https://www.encodeproject.org/) and FANTOM5 (http://fantom.gsc.riken.jp/5/), we can say that upward of 200,000 regions of the genome are transcribed into mRNA. The majority of these do not code for a protein and are non-coding.

More than 50% of the genome also exhibits some level of homology, which includes processed pseudogenes (where just the mRNA sequence is copied elsewhere) and unprocessed pseudogenes (where the entire gene, or parts of it, including introns is copied elsewhere).

The definition of 'gene' itself needs to be considered. Non-coding RNAs that have proven function are regarded as single-exon genes - it's just that they are never translated into proteins.

The count from the most recent RNA-seq experiment that I did on a few hundred samples was 199,169 different mRNA transcripts.

ADD REPLY • link 6.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks! My confusion was actually in terminology, did't realize for the moment that gene_id tag can refer to other things than protein-coding genes. My mistake!

ADD REPLY • link 6.6 years ago by grant.hovhannisyan ★ 2.6k

5

Entering edit mode

6.6 years ago

Emily 23k

20,338 protein coding genes on the primary assembly. 22,521 non-coding genes 14,638 pseudogenes

= 56497 genes on the primary assembly

On the alternate assemblies (haplotypes and patches): 2,750 protein coding 1,288 non-coding 1,600 pseudogenes

= 5038 genes on the alternate assemblies.

Total genes = 61535

Overall, this means that any given person will have ~20k coding genes, but more non-coding and pseudogenes. Another person will have a slightly different set of ~20k genes.

ADD COMMENT • link 6.6 years ago by Emily 23k

2

Entering edit mode

6.6 years ago

EagleEye 7.5k

There are also categories other than protein coding genes (majority: non-coding RNAs) in the genome. Here is the summary from recent version of human genome.

ADD COMMENT • link 6.6 years ago by EagleEye 7.5k

score 4 · Accepted Answer · 2017-09-13

4

Entering edit mode

6.6 years ago

Pierre Lindenbaum 161k

a quick check:

$ curl -s "ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.gtf.gz" | gunzip -c | awk '($3=="gene")' | cut -f 9 | tr ";" "\n" | grep gene_biotype | sed 's/gene_biotype//' | sort | uniq -c | sort -rn
  19847   "protein_coding"
  10235   "processed_pseudogene"
   7493   "lincRNA"
   5517   "antisense_RNA"
   2637   "unprocessed_pseudogene"
   2221   "misc_RNA"
   1909   "snRNA"
   1879   "miRNA"
   1066   "TEC"
    943   "snoRNA"
    904   "sense_intronic"
    828   "transcribed_unprocessed_pseudogene"
    549   "rRNA"
    543   "processed_transcript"
    462   "transcribed_processed_pseudogene"
    189   "sense_overlapping"
    188   "IG_V_pseudogene"
    144   "IG_V_gene"
    111   "transcribed_unitary_pseudogene"
    108   "TR_V_gene"
     95   "unitary_pseudogene"
     79   "TR_J_gene"
     63   "polymorphic_pseudogene"
     49   "scaRNA"
     37   "IG_D_gene"
     31   "3prime_overlapping_ncRNA"
     30   "TR_V_pseudogene"
     22   "pseudogene"
     22   "Mt_tRNA"
     19   "bidirectional_promoter_lncRNA"
     18   "IG_J_gene"
     14   "IG_C_gene"
      9   "IG_C_pseudogene"
      8   "ribozyme"
      6   "TR_C_gene"
      5   "sRNA"
      4   "TR_J_pseudogene"
      4   "TR_D_gene"
      3   "non_coding"
      3   "IG_J_pseudogene"
      2   "translated_processed_pseudogene"
      2   "Mt_rRNA"
      1   "vaultRNA"
      1   "scRNA"
      1   "macro_lncRNA"
      1   "IG_pseudogene"

ADD COMMENT • link 6.6 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Current GENCODE annotation has (for GRCh38.p10):

gunzip -c gencode.v27.annotation.gtf.gz | awk '($3=="gene")' | cut -f 9 | tr ";" "\n" | grep gene_type | sed 's/gene_type//' | sort | uniq -c | sort -rn
  19836   "protein_coding"
  10240   "processed_pseudogene"
   7499   "lincRNA"
   5521   "antisense_RNA"
   2639   "unprocessed_pseudogene"
   2213   "misc_RNA"
   1900   "snRNA"
   1881   "miRNA"
   1066   "TEC"
    943   "snoRNA"
    905   "sense_intronic"
    830   "transcribed_unprocessed_pseudogene"
    544   "rRNA"
    544   "processed_transcript"
    462   "transcribed_processed_pseudogene"
    189   "sense_overlapping"
    188   "IG_V_pseudogene"
    144   "IG_V_gene"
    111   "transcribed_unitary_pseudogene"
    108   "TR_V_gene"
     95   "unitary_pseudogene"
     79   "TR_J_gene"
     63   "polymorphic_pseudogene"
     49   "scaRNA"
     37   "IG_D_gene"
     31   "3prime_overlapping_ncRNA"
     30   "TR_V_pseudogene"
     22   "Mt_tRNA"
     19   "bidirectional_promoter_lncRNA"
     18   "pseudogene"
     18   "IG_J_gene"
     14   "IG_C_gene"
      9   "IG_C_pseudogene"
      8   "ribozyme"
      6   "TR_C_gene"
      5   "sRNA"
      4   "TR_J_pseudogene"
      4   "TR_D_gene"
      3   "non_coding"
      3   "IG_J_pseudogene"
      2   "translated_processed_pseudogene"
      2   "Mt_rRNA"
      1   "vaultRNA"
      1   "scRNA"
      1   "macro_lncRNA"
      1   "IG_pseudogene"