Question: What is agi and bgi in the Ensembl gtf file
2
gravatar for lakhujanivijay
2.6 years ago by
lakhujanivijay5.3k
India/Ahmedabad
lakhujanivijay5.3k wrote:

I am trying to use the new tuxedo pipeline for my RNA-seq data.

I have downloaded the Oryza Sativa indica GTF file from Ensembl and have pasted few lines below

#!genome-build ASM465v1
#!genome-version ASM465v1
#!genome-date 2005-01
#!genome-build-accession GCA_000004655.2
#!genebuild-last-updated 2010-07
1       agi     gene    13717   13879   .       +       .       gene_id "EPlOING00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA";
1       agi     transcript      13717   13879   .       +       .       gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA";
1       agi     exon    13717   13879   .       +       .       gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; exon_number "1"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA"; exon_id "EPlOINE00000043550";
1       bgi     gene    18113   20165   .       +       .       gene_id "BGIOSGA002568"; gene_source "bgi"; gene_biotype "protein_coding";
1       bgi     transcript      18113   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    18113   19150   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.1";
1       bgi     CDS     18113   19150   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1       bgi     start_codon     18113   18115   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    19344   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.2";
1       bgi     CDS     19344   20162   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1       bgi     stop_codon      20163   20165   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       agi     gene    21086   21198   .       -       .       gene_id "EPlOING00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA";
1       agi     transcript      21086   21198   .       -       .       gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA";
1       agi     exon    21086   21198   .       -       .       gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; exon_number "1"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA"; exon_id "EPlOINE0000000

The number of coding genes (40,745) matches the outcome of following command

awk -F "\t" '$3=="gene"{print }' Oryza_indica.ASM465v1.38.gtf | grep bgi | wc -l

I want to know what is bgi and agi in the 2nd column. Shall I keep only bgi enteries? I know that this represent different sources i.e. bgi is Bejing genomics. However, keeping both may be an issue

ensembl rna-seq bgi gtf agi • 976 views
ADD COMMENTlink modified 2.6 years ago by Emily_Ensembl21k • written 2.6 years ago by lakhujanivijay5.3k

In that snippet posted above the entries do not appear to overlap/match (is that the case for the entire file, are all agi entries non-coding entities)? If the entities are unique then you may need to keep them. Does that suddenly double the gene number?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax91k

Yes, entries do not overlap

command (extracting biotype information for agi entries)

awk -F "\t" '$3=="gene"{print $9 }' Oryza_indica.ASM465v1.38.gtf | grep agi | awk -F ";" '{print $4}' | sort | uniq

output

 gene_biotype "antisense"
 gene_biotype "miRNA"
 gene_biotype "misc_RNA"
 gene_biotype "ncRNA"
 gene_biotype "P_RNA"
 gene_biotype "ribozyme"
 gene_biotype "RNase_MRP_RNA"
 gene_biotype "rRNA"
 gene_biotype "snoRNA"
 gene_biotype "snRNA"
 gene_biotype "SRP_RNA"
 gene_biotype "telomerase_RNA"
 gene_biotype "tmRNA"
 gene_biotype "tRNA"
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by lakhujanivijay5.3k
4
gravatar for Emily_Ensembl
2.6 years ago by
Emily_Ensembl21k
EMBL-EBI
Emily_Ensembl21k wrote:

The indica rice genome has two sources of annotation, BGI (Beijing Genome Institute) for coding genes and AGI (Arizona Genome Institute) for non-coding.

ADD COMMENTlink written 2.6 years ago by Emily_Ensembl21k

Thanks Emily_Ensembl

That was helpful. Can you help me understand the stats here

Non coding genes    48,978
Small non coding genes  43,562
Long non coding genes   240
Misc non coding genes   5,176

The output of below command does not match the stats at this page for non coding genes

command

$awk -F "\t" '$3=="gene"{print}' Oryza_indica.ASM465v1.38.gtf | grep agi | wc -l
$47693
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by lakhujanivijay5.3k

There are also other sources of ncRNA genes:

  • tRNAs are generated by using tRNAscan
  • Rfam for many types ncRNAs
  • some from ENA
ADD REPLYlink written 2.6 years ago by Emily_Ensembl21k

Thanks Emily, that was of immense help! Everything is clear now. :D

ADD REPLYlink written 2.6 years ago by lakhujanivijay5.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1674 users visited in the last hour