Question: number of genes in human genome
0
gravatar for grant.hovhannisyan
8 days ago by
grant.hovhannisyan260 wrote:

Hi Biostars,

To my knowledge there are around 20k genes in human genome. But in ensembl gtf file number of gene_ids is around 60k. Are these 40k gene_ids pseudogenes or I am missing something?

Thanks,

ensembl gene_id • 157 views
ADD COMMENTlink modified 8 days ago by Emily_Ensembl13k • written 8 days ago by grant.hovhannisyan260
1

That's a big question and people have different ideas and opinions on it - you'll never get a definitive answer. From ENCODE (https://www.encodeproject.org/) and FANTOM5 (http://fantom.gsc.riken.jp/5/), we can say that upward of 200,000 regions of the genome are transcribed into mRNA. The majority of these do not code for a protein and are non-coding.

More than 50% of the genome also exhibits some level of homology, which includes processed pseudogenes (where just the mRNA sequence is copied elsewhere) and unprocessed pseudogenes (where the entire gene, or parts of it, including introns is copied elsewhere).

The definition of 'gene' itself needs to be considered. Non-coding RNAs that have proven function are regarded as single-exon genes - it's just that they are never translated into proteins.

The count from the most recent RNA-seq experiment that I did on a few hundred samples was 199,169 different mRNA transcripts.

ADD REPLYlink modified 8 days ago • written 8 days ago by Kevin Blighe1.1k

Thanks! My confusion was actually in terminology, did't realize for the moment that gene_id tag can refer to other things than protein-coding genes. My mistake!

ADD REPLYlink modified 8 days ago • written 8 days ago by grant.hovhannisyan260
4
gravatar for Pierre Lindenbaum
8 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:

a quick check:

$ curl -s "ftp://ftp.ensembl.org/pub/release-90/gtf/homo_sapiens/Homo_sapiens.GRCh38.90.gtf.gz" | gunzip -c | awk '($3=="gene")' | cut -f 9 | tr ";" "\n" | grep gene_biotype | sed 's/gene_biotype//' | sort | uniq -c | sort -rn
  19847   "protein_coding"
  10235   "processed_pseudogene"
   7493   "lincRNA"
   5517   "antisense_RNA"
   2637   "unprocessed_pseudogene"
   2221   "misc_RNA"
   1909   "snRNA"
   1879   "miRNA"
   1066   "TEC"
    943   "snoRNA"
    904   "sense_intronic"
    828   "transcribed_unprocessed_pseudogene"
    549   "rRNA"
    543   "processed_transcript"
    462   "transcribed_processed_pseudogene"
    189   "sense_overlapping"
    188   "IG_V_pseudogene"
    144   "IG_V_gene"
    111   "transcribed_unitary_pseudogene"
    108   "TR_V_gene"
     95   "unitary_pseudogene"
     79   "TR_J_gene"
     63   "polymorphic_pseudogene"
     49   "scaRNA"
     37   "IG_D_gene"
     31   "3prime_overlapping_ncRNA"
     30   "TR_V_pseudogene"
     22   "pseudogene"
     22   "Mt_tRNA"
     19   "bidirectional_promoter_lncRNA"
     18   "IG_J_gene"
     14   "IG_C_gene"
      9   "IG_C_pseudogene"
      8   "ribozyme"
      6   "TR_C_gene"
      5   "sRNA"
      4   "TR_J_pseudogene"
      4   "TR_D_gene"
      3   "non_coding"
      3   "IG_J_pseudogene"
      2   "translated_processed_pseudogene"
      2   "Mt_rRNA"
      1   "vaultRNA"
      1   "scRNA"
      1   "macro_lncRNA"
      1   "IG_pseudogene"
ADD COMMENTlink written 8 days ago by Pierre Lindenbaum98k

Perfect, thank you! Seems like magic :)

ADD REPLYlink written 8 days ago by grant.hovhannisyan260

Current GENCODE annotation has (for GRCh38.p10):

gunzip -c gencode.v27.annotation.gtf.gz | awk '($3=="gene")' | cut -f 9 | tr ";" "\n" | grep gene_type | sed 's/gene_type//' | sort | uniq -c | sort -rn
  19836   "protein_coding"
  10240   "processed_pseudogene"
   7499   "lincRNA"
   5521   "antisense_RNA"
   2639   "unprocessed_pseudogene"
   2213   "misc_RNA"
   1900   "snRNA"
   1881   "miRNA"
   1066   "TEC"
    943   "snoRNA"
    905   "sense_intronic"
    830   "transcribed_unprocessed_pseudogene"
    544   "rRNA"
    544   "processed_transcript"
    462   "transcribed_processed_pseudogene"
    189   "sense_overlapping"
    188   "IG_V_pseudogene"
    144   "IG_V_gene"
    111   "transcribed_unitary_pseudogene"
    108   "TR_V_gene"
     95   "unitary_pseudogene"
     79   "TR_J_gene"
     63   "polymorphic_pseudogene"
     49   "scaRNA"
     37   "IG_D_gene"
     31   "3prime_overlapping_ncRNA"
     30   "TR_V_pseudogene"
     22   "Mt_tRNA"
     19   "bidirectional_promoter_lncRNA"
     18   "pseudogene"
     18   "IG_J_gene"
     14   "IG_C_gene"
      9   "IG_C_pseudogene"
      8   "ribozyme"
      6   "TR_C_gene"
      5   "sRNA"
      4   "TR_J_pseudogene"
      4   "TR_D_gene"
      3   "non_coding"
      3   "IG_J_pseudogene"
      2   "translated_processed_pseudogene"
      2   "Mt_rRNA"
      1   "vaultRNA"
      1   "scRNA"
      1   "macro_lncRNA"
      1   "IG_pseudogene"
ADD REPLYlink modified 8 days ago • written 8 days ago by genomax33k
4
gravatar for Emily_Ensembl
8 days ago by
Emily_Ensembl13k
EMBL-EBI
Emily_Ensembl13k wrote:

20,338 protein coding genes on the primary assembly. 22,521 non-coding genes 14,638 pseudogenes

= 56497 genes on the primary assembly

On the alternate assemblies (haplotypes and patches): 2,750 protein coding 1,288 non-coding 1,600 pseudogenes

= 5038 genes on the alternate assemblies.

Total genes = 61535

Overall, this means that any given person will have ~20k coding genes, but more non-coding and pseudogenes. Another person will have a slightly different set of ~20k genes.

ADD COMMENTlink written 8 days ago by Emily_Ensembl13k
2
gravatar for EagleEye
8 days ago by
EagleEye4.7k
Sweden
EagleEye4.7k wrote:

There are also categories other than protein coding genes (majority: non-coding RNAs) in the genome. Here is the summary from recent version of human genome.

ADD COMMENTlink modified 8 days ago • written 8 days ago by EagleEye4.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1209 users visited in the last hour