Question: number of genes in human genome
0
gravatar for grant.hovhannisyan
18 months ago by
grant.hovhannisyan1.4k wrote:

Hi Biostars,

To my knowledge there are around 20k genes in human genome. But in ensembl gtf file number of gene_ids is around 60k. Are these 40k gene_ids pseudogenes or I am missing something?

Thanks,

ensembl gene_id • 790 views
ADD COMMENTlink modified 18 months ago by Emily_Ensembl17k • written 18 months ago by grant.hovhannisyan1.4k
1

That's a big question and people have different ideas and opinions on it - you'll never get a definitive answer. From ENCODE (https://www.encodeproject.org/) and FANTOM5 (http://fantom.gsc.riken.jp/5/), we can say that upward of 200,000 regions of the genome are transcribed into mRNA. The majority of these do not code for a protein and are non-coding.

More than 50% of the genome also exhibits some level of homology, which includes processed pseudogenes (where just the mRNA sequence is copied elsewhere) and unprocessed pseudogenes (where the entire gene, or parts of it, including introns is copied elsewhere).

The definition of 'gene' itself needs to be considered. Non-coding RNAs that have proven function are regarded as single-exon genes - it's just that they are never translated into proteins.

The count from the most recent RNA-seq experiment that I did on a few hundred samples was 199,169 different mRNA transcripts.

ADD REPLYlink modified 18 months ago • written 18 months ago by Kevin Blighe39k

Thanks! My confusion was actually in terminology, did't realize for the moment that gene_id tag can refer to other things than protein-coding genes. My mistake!

ADD REPLYlink modified 18 months ago • written 18 months ago by grant.hovhannisyan1.4k
4
gravatar for Pierre Lindenbaum
18 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

a quick check:

$ curl -s "ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.gtf.gz" | gunzip -c | awk '($3=="gene")' | cut -f 9 | tr ";" "\n" | grep gene_biotype | sed 's/gene_biotype//' | sort | uniq -c | sort -rn
  19847   "protein_coding"
  10235   "processed_pseudogene"
   7493   "lincRNA"
   5517   "antisense_RNA"
   2637   "unprocessed_pseudogene"
   2221   "misc_RNA"
   1909   "snRNA"
   1879   "miRNA"
   1066   "TEC"
    943   "snoRNA"
    904   "sense_intronic"
    828   "transcribed_unprocessed_pseudogene"
    549   "rRNA"
    543   "processed_transcript"
    462   "transcribed_processed_pseudogene"
    189   "sense_overlapping"
    188   "IG_V_pseudogene"
    144   "IG_V_gene"
    111   "transcribed_unitary_pseudogene"
    108   "TR_V_gene"
     95   "unitary_pseudogene"
     79   "TR_J_gene"
     63   "polymorphic_pseudogene"
     49   "scaRNA"
     37   "IG_D_gene"
     31   "3prime_overlapping_ncRNA"
     30   "TR_V_pseudogene"
     22   "pseudogene"
     22   "Mt_tRNA"
     19   "bidirectional_promoter_lncRNA"
     18   "IG_J_gene"
     14   "IG_C_gene"
      9   "IG_C_pseudogene"
      8   "ribozyme"
      6   "TR_C_gene"
      5   "sRNA"
      4   "TR_J_pseudogene"
      4   "TR_D_gene"
      3   "non_coding"
      3   "IG_J_pseudogene"
      2   "translated_processed_pseudogene"
      2   "Mt_rRNA"
      1   "vaultRNA"
      1   "scRNA"
      1   "macro_lncRNA"
      1   "IG_pseudogene"
ADD COMMENTlink written 18 months ago by Pierre Lindenbaum118k

Perfect, thank you! Seems like magic :)

ADD REPLYlink written 18 months ago by grant.hovhannisyan1.4k

Current GENCODE annotation has (for GRCh38.p10):

gunzip -c gencode.v27.annotation.gtf.gz | awk '($3=="gene")' | cut -f 9 | tr ";" "\n" | grep gene_type | sed 's/gene_type//' | sort | uniq -c | sort -rn
  19836   "protein_coding"
  10240   "processed_pseudogene"
   7499   "lincRNA"
   5521   "antisense_RNA"
   2639   "unprocessed_pseudogene"
   2213   "misc_RNA"
   1900   "snRNA"
   1881   "miRNA"
   1066   "TEC"
    943   "snoRNA"
    905   "sense_intronic"
    830   "transcribed_unprocessed_pseudogene"
    544   "rRNA"
    544   "processed_transcript"
    462   "transcribed_processed_pseudogene"
    189   "sense_overlapping"
    188   "IG_V_pseudogene"
    144   "IG_V_gene"
    111   "transcribed_unitary_pseudogene"
    108   "TR_V_gene"
     95   "unitary_pseudogene"
     79   "TR_J_gene"
     63   "polymorphic_pseudogene"
     49   "scaRNA"
     37   "IG_D_gene"
     31   "3prime_overlapping_ncRNA"
     30   "TR_V_pseudogene"
     22   "Mt_tRNA"
     19   "bidirectional_promoter_lncRNA"
     18   "pseudogene"
     18   "IG_J_gene"
     14   "IG_C_gene"
      9   "IG_C_pseudogene"
      8   "ribozyme"
      6   "TR_C_gene"
      5   "sRNA"
      4   "TR_J_pseudogene"
      4   "TR_D_gene"
      3   "non_coding"
      3   "IG_J_pseudogene"
      2   "translated_processed_pseudogene"
      2   "Mt_rRNA"
      1   "vaultRNA"
      1   "scRNA"
      1   "macro_lncRNA"
      1   "IG_pseudogene"
ADD REPLYlink modified 18 months ago • written 18 months ago by genomax64k
5
gravatar for Emily_Ensembl
18 months ago by
Emily_Ensembl17k
EMBL-EBI
Emily_Ensembl17k wrote:

20,338 protein coding genes on the primary assembly. 22,521 non-coding genes 14,638 pseudogenes

= 56497 genes on the primary assembly

On the alternate assemblies (haplotypes and patches): 2,750 protein coding 1,288 non-coding 1,600 pseudogenes

= 5038 genes on the alternate assemblies.

Total genes = 61535

Overall, this means that any given person will have ~20k coding genes, but more non-coding and pseudogenes. Another person will have a slightly different set of ~20k genes.

ADD COMMENTlink written 18 months ago by Emily_Ensembl17k
2
gravatar for EagleEye
18 months ago by
EagleEye6.2k
Sweden
EagleEye6.2k wrote:

There are also categories other than protein coding genes (majority: non-coding RNAs) in the genome. Here is the summary from recent version of human genome.

ADD COMMENTlink modified 18 months ago • written 18 months ago by EagleEye6.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1527 users visited in the last hour