What is agi and bgi in the Ensembl gtf file
1
2
Entering edit mode
6.1 years ago

I am trying to use the new tuxedo pipeline for my RNA-seq data.

I have downloaded the Oryza Sativa indica GTF file from Ensembl and have pasted few lines below

#!genome-build ASM465v1
#!genome-version ASM465v1
#!genome-date 2005-01
#!genome-build-accession GCA_000004655.2
#!genebuild-last-updated 2010-07
1       agi     gene    13717   13879   .       +       .       gene_id "EPlOING00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA";
1       agi     transcript      13717   13879   .       +       .       gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA";
1       agi     exon    13717   13879   .       +       .       gene_id "EPlOING00000043550"; transcript_id "EPlOINT00000043550"; exon_number "1"; gene_name "SNORA23"; gene_source "agi"; gene_biotype "snoRNA"; transcript_name "SNORA23"; transcript_source "agi"; transcript_biotype "snoRNA"; exon_id "EPlOINE00000043550";
1       bgi     gene    18113   20165   .       +       .       gene_id "BGIOSGA002568"; gene_source "bgi"; gene_biotype "protein_coding";
1       bgi     transcript      18113   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    18113   19150   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.1";
1       bgi     CDS     18113   19150   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1       bgi     start_codon     18113   18115   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "1"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       bgi     exon    19344   20165   .       +       .       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; exon_id "BGIOSGA002568-TA.2";
1       bgi     CDS     19344   20162   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding"; protein_id "BGIOSGA002568-PA"; protein_version "1";
1       bgi     stop_codon      20163   20165   .       +       0       gene_id "BGIOSGA002568"; transcript_id "BGIOSGA002568-TA"; exon_number "2"; gene_source "bgi"; gene_biotype "protein_coding"; transcript_source "bgi"; transcript_biotype "protein_coding";
1       agi     gene    21086   21198   .       -       .       gene_id "EPlOING00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA";
1       agi     transcript      21086   21198   .       -       .       gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA";
1       agi     exon    21086   21198   .       -       .       gene_id "EPlOING00000001909"; transcript_id "EPlOINT00000001909"; exon_number "1"; gene_name "MIR408"; gene_source "agi"; gene_biotype "miRNA"; transcript_name "MIR408"; transcript_source "agi"; transcript_biotype "miRNA"; exon_id "EPlOINE0000000

The number of coding genes (40,745) matches the outcome of following command

awk -F "\t" '$3=="gene"{print }' Oryza_indica.ASM465v1.38.gtf | grep bgi | wc -l

I want to know what is bgi and agi in the 2nd column. Shall I keep only bgi enteries? I know that this represent different sources i.e. bgi is Bejing genomics. However, keeping both may be an issue

bgi agi gtf ensembl RNA-Seq • 1.9k views
ADD COMMENT
0
Entering edit mode

In that snippet posted above the entries do not appear to overlap/match (is that the case for the entire file, are all agi entries non-coding entities)? If the entities are unique then you may need to keep them. Does that suddenly double the gene number?

ADD REPLY
0
Entering edit mode

Yes, entries do not overlap

command (extracting biotype information for agi entries)

awk -F "\t" '$3=="gene"{print $9 }' Oryza_indica.ASM465v1.38.gtf | grep agi | awk -F ";" '{print $4}' | sort | uniq

output

 gene_biotype "antisense"
 gene_biotype "miRNA"
 gene_biotype "misc_RNA"
 gene_biotype "ncRNA"
 gene_biotype "P_RNA"
 gene_biotype "ribozyme"
 gene_biotype "RNase_MRP_RNA"
 gene_biotype "rRNA"
 gene_biotype "snoRNA"
 gene_biotype "snRNA"
 gene_biotype "SRP_RNA"
 gene_biotype "telomerase_RNA"
 gene_biotype "tmRNA"
 gene_biotype "tRNA"
ADD REPLY
4
Entering edit mode
6.1 years ago
Emily 23k

The indica rice genome has two sources of annotation, BGI (Beijing Genome Institute) for coding genes and AGI (Arizona Genome Institute) for non-coding.

ADD COMMENT
0
Entering edit mode

Thanks Emily_Ensembl

That was helpful. Can you help me understand the stats here

Non coding genes    48,978
Small non coding genes  43,562
Long non coding genes   240
Misc non coding genes   5,176

The output of below command does not match the stats at this page for non coding genes

command

$awk -F "\t" '$3=="gene"{print}' Oryza_indica.ASM465v1.38.gtf | grep agi | wc -l
$47693
ADD REPLY
0
Entering edit mode

There are also other sources of ncRNA genes:

  • tRNAs are generated by using tRNAscan
  • Rfam for many types ncRNAs
  • some from ENA
ADD REPLY
0
Entering edit mode

Thanks Emily, that was of immense help! Everything is clear now. :D

ADD REPLY

Login before adding your answer.

Traffic: 3254 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6