Question: What is agi and bgi in the Ensembl gtf file
2
gravatar for Vijay Lakhujani
13 months ago by
Vijay Lakhujani4.0k
India
Vijay Lakhujani4.0k wrote:

I am trying to use the new tuxedo pipeline for my RNA-seq data.

I have downloaded the Oryza Sativa indica GTF file from Ensembl and have pasted few lines below

#!genome-build ASM465v1
#!genome-version ASM465v1
#!genome-date 2005-01
#!genome-build-accession GCA_000004655.2
#!genebuild-last-updated 2010-07
1       agi     gene    13717   13879   .       +       .       gene_id "EPlOING00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA";
1       agi     transcript      13717   13879   .       +       .       gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA";
1       agi     exon    13717   13879   .       +       .       gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; exon_number "1"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA"; exon_id "EPlOINE00000043550";
1       bgi     gene    18113   20165   .       +       .       gene_id "BGIOSGA002568"; gene_source "bgi"; gene_biotype "protein_coding";
1       bgi     transcript      18113   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    18113   19150   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.1";
1       bgi     CDS     18113   19150   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1       bgi     start_codon     18113   18115   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    19344   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.2";
1       bgi     CDS     19344   20162   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1       bgi     stop_codon      20163   20165   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       agi     gene    21086   21198   .       -       .       gene_id "EPlOING00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA";
1       agi     transcript      21086   21198   .       -       .       gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA";
1       agi     exon    21086   21198   .       -       .       gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; exon_number "1"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA"; exon_id "EPlOINE0000000

The number of coding genes (40,745) matches the outcome of following command

awk -F "\t" '$3=="gene"{print }' Oryza_indica.ASM465v1.38.gtf | grep bgi | wc -l

I want to know what is bgi and agi in the 2nd column. Shall I keep only bgi enteries? I know that this represent different sources i.e. bgi is Bejing genomics. However, keeping both may be an issue

ensembl rna-seq bgi gtf agi • 549 views
ADD COMMENTlink modified 13 months ago by Emily_Ensembl18k • written 13 months ago by Vijay Lakhujani4.0k

In that snippet posted above the entries do not appear to overlap/match (is that the case for the entire file, are all agi entries non-coding entities)? If the entities are unique then you may need to keep them. Does that suddenly double the gene number?

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax65k

Yes, entries do not overlap

command (extracting biotype information for agi entries)

awk -F "\t" '$3=="gene"{print $9 }' Oryza_indica.ASM465v1.38.gtf | grep agi | awk -F ";" '{print $4}' | sort | uniq

output

 gene_biotype "antisense"
 gene_biotype "miRNA"
 gene_biotype "misc_RNA"
 gene_biotype "ncRNA"
 gene_biotype "P_RNA"
 gene_biotype "ribozyme"
 gene_biotype "RNase_MRP_RNA"
 gene_biotype "rRNA"
 gene_biotype "snoRNA"
 gene_biotype "snRNA"
 gene_biotype "SRP_RNA"
 gene_biotype "telomerase_RNA"
 gene_biotype "tmRNA"
 gene_biotype "tRNA"
ADD REPLYlink modified 13 months ago • written 13 months ago by Vijay Lakhujani4.0k
4
gravatar for Emily_Ensembl
13 months ago by
Emily_Ensembl18k
EMBL-EBI
Emily_Ensembl18k wrote:

The indica rice genome has two sources of annotation, BGI (Beijing Genome Institute) for coding genes and AGI (Arizona Genome Institute) for non-coding.

ADD COMMENTlink written 13 months ago by Emily_Ensembl18k

Thanks Emily_Ensembl

That was helpful. Can you help me understand the stats here

Non coding genes    48,978
Small non coding genes  43,562
Long non coding genes   240
Misc non coding genes   5,176

The output of below command does not match the stats at this page for non coding genes

command

$awk -F "\t" '$3=="gene"{print}' Oryza_indica.ASM465v1.38.gtf | grep agi | wc -l
$47693
ADD REPLYlink modified 13 months ago • written 13 months ago by Vijay Lakhujani4.0k

There are also other sources of ncRNA genes:

  • tRNAs are generated by using tRNAscan
  • Rfam for many types ncRNAs
  • some from ENA
ADD REPLYlink written 13 months ago by Emily_Ensembl18k

Thanks Emily, that was of immense help! Everything is clear now. :D

ADD REPLYlink written 13 months ago by Vijay Lakhujani4.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1635 users visited in the last hour