Creating taxonomic database using GFF3 file ?
1
1
Entering edit mode
3.1 years ago
lokraj2003 ▴ 90

I am trying to create a taxonomic database using GenomicFeatures package. I downloaded GFF3 file from the NCBI.

Codes :

orf <- GenomicFeatures::makeTxDbFromGFF("orf.gff3",format="auto")

I get following output :

Orf

TxDb object:

Db type: TxDb

Supporting package: GenomicFeatures

Data source: mouse.gff3

Organism: NA

Taxonomy ID: NA

miRBase build ID: NA

Genome: NA

transcript_nrow: 0

exon_nrow: 0

cds_nrow: 0

Db created by: GenomicFeatures package from Bioconductor

Creation time: 2019-05-29 22:32:09 -0500 (Wed, 29 May 2019)

GenomicFeatures version at creation time: 1.32.2

RSQLite version at creation time: 2.1.1

DBSCHEMAVERSION: 1.2

Link to the genome : https://www.ncbi.nlm.nih.gov/nuccore/AY386263.1

As you can see that there are no genes in this database. Can anyone help with this please ?

Bioconductor GenomeFeatures • 1.1k views
ADD COMMENT
2
Entering edit mode
3.1 years ago
AK ★ 2.1k

Hi lokraj2003,

You can add gene features to the gff3 file that you downloaded from https://www.ncbi.nlm.nih.gov/nuccore/AY386263.1, then re-load it again using the same function.

Like this (it can be any method you like to re-format the original gff3, here for example awk with focus on creating gene lines, adding Parent to each CDS, and I leave other detail parsing to you):

$ cat orf.gff3 \
  | awk 'BEGIN{FS=OFS="\t"} $3!="CDS"{print $0} $3=="CDS"{GENE=$0; gsub("\t0\t", "\t\.\t", GENE); gsub("CDS", "gene", GENE); gsub("cds", "gene", GENE); gsub(";product=.*", "", GENE); print GENE; ID=$9; gsub(".*;protein_id=", "", ID); print $0 ";Parent=gene-" ID}' \
  > orf_re.gff3

$ head orf_re.gff3
##sequence-region AY386263.1 1 137241
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=10258
AY386263.1  Genbank region  1   137241  .   +   .   ID=AY386263.1:1..137241;Dbxref=taxon:10258;country=USA: Iowa;gbkey=Src;genome=genomic;isolate=ORFA;isolation-source=nasal secretions of a lamb at the Iowa Ram Test Station during an outbreak in 1982%2C then passaged in ovine fetal turbinate cells;mol_type=genomic DNA;strain=OV-IA82
AY386263.1  Genbank gene    2409    2858    .   -   .   ID=gene-AAR98099.1;Dbxref=NCBI_GP:AAR98099.1;Name=AAR98099.1;gbkey=gene
AY386263.1  Genbank CDS 2409    2858    .   -   0   ID=cds-AAR98099.1;Dbxref=NCBI_GP:AAR98099.1;Name=AAR98099.1;gbkey=CDS;product=ORF001 hypothetical protein;protein_id=AAR98099.1;Parent=gene-AAR98099.1

And using that gff3 you'll get:

> orf <- GenomicFeatures::makeTxDbFromGFF("orf_re.gff3", format = "auto")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
> orf
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: orf_re.gff3
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 130
# exon_nrow: 130
# cds_nrow: 130
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2019-05-30 16:57:03 +0200 (Thu, 30 May 2019)
# GenomicFeatures version at creation time: 1.34.8
# RSQLite version at creation time: 2.1.1
# DBSCHEMAVERSION: 1.2

Hope it helps. :-)

ADD COMMENT
0
Entering edit mode

Thank you ! It worked.

ADD REPLY
0
Entering edit mode

You're welcome! If an answer was helpful you can upvote it, if the answer resolved your question you can mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY

Login before adding your answer.

Traffic: 1443 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6