Question: Creating taxonomic database using GFF3 file ?
1
gravatar for lokraj2003
6 months ago by
lokraj200380
lokraj200380 wrote:

I am trying to create a taxonomic database using GenomicFeatures package. I downloaded GFF3 file from the NCBI.

Codes :

orf <- GenomicFeatures::makeTxDbFromGFF("orf.gff3",format="auto")

I get following output :

Orf

TxDb object:

Db type: TxDb

Supporting package: GenomicFeatures

Data source: mouse.gff3

Organism: NA

Taxonomy ID: NA

miRBase build ID: NA

Genome: NA

transcript_nrow: 0

exon_nrow: 0

cds_nrow: 0

Db created by: GenomicFeatures package from Bioconductor

Creation time: 2019-05-29 22:32:09 -0500 (Wed, 29 May 2019)

GenomicFeatures version at creation time: 1.32.2

RSQLite version at creation time: 2.1.1

DBSCHEMAVERSION: 1.2

Link to the genome : https://www.ncbi.nlm.nih.gov/nuccore/AY386263.1

As you can see that there are no genes in this database. Can anyone help with this please ?

ADD COMMENTlink modified 6 months ago by SMK1.9k • written 6 months ago by lokraj200380
2
gravatar for SMK
6 months ago by
SMK1.9k
SMK1.9k wrote:

Hi lokraj2003,

You can add gene features to the gff3 file that you downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AY386263.1, then re-load it again using the same function.

Like this (it can be any method you like to re-format the original gff3, here for example awk with focus on creating gene lines, adding Parent to each CDS, and I leave other detail parsing to you):

$ cat orf.gff3 \
  | awk 'BEGIN{FS=OFS="\t"} $3!="CDS"{print $0} $3=="CDS"{GENE=$0; gsub("\t0\t", "\t\.\t", GENE); gsub("CDS", "gene", GENE); gsub("cds", "gene", GENE); gsub(";product=.*", "", GENE); print GENE; ID=$9; gsub(".*;protein_id=", "", ID); print $0 ";Parent=gene-" ID}' \
  > orf_re.gff3

$ head orf_re.gff3
##sequence-region AY386263.1 1 137241
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10258
AY386263.1  Genbank region  1   137241  .   +   .   ID=AY386263.1:1..137241;Dbxref=taxon:10258;country=USA: Iowa;gbkey=Src;genome=genomic;isolate=ORFA;isolation-source=nasal secretions of a lamb at the Iowa Ram Test Station during an outbreak in 1982%2C then passaged in ovine fetal turbinate cells;mol_type=genomic DNA;strain=OV-IA82
AY386263.1  Genbank gene    2409    2858    .   -   .   ID=gene-AAR98099.1;Dbxref=NCBI_GP:AAR98099.1;Name=AAR98099.1;gbkey=gene
AY386263.1  Genbank CDS 2409    2858    .   -   0   ID=cds-AAR98099.1;Dbxref=NCBI_GP:AAR98099.1;Name=AAR98099.1;gbkey=CDS;product=ORF001 hypothetical protein;protein_id=AAR98099.1;Parent=gene-AAR98099.1

And using that gff3 you'll get:

> orf <- GenomicFeatures::makeTxDbFromGFF("orf_re.gff3", format = "auto")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> orf
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: orf_re.gff3
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 130
# exon_nrow: 130
# cds_nrow: 130
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-05-30 16:57:03 +0200 (Thu, 30 May 2019)
# GenomicFeatures version at creation time: 1.34.8
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2

Hope it helps. :-)

ADD COMMENTlink modified 6 months ago • written 6 months ago by SMK1.9k

Thank you ! It worked.

ADD REPLYlink written 6 months ago by lokraj200380

You're welcome! If an answer was helpful you can upvote it, if the answer resolved your question you can mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLYlink modified 6 months ago • written 6 months ago by SMK1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 992 users visited in the last hour