Question

Annotation of multiple reference genomes

0

Entering edit mode

3.6 years ago

nitinra ▴ 50

Hello all,

Project background: I am trying to look at chemosensory evolution (GRs and ORs) across insect orders. I am specifically planning to look at GR and OR diversity between herbivore vs non-herbivore insect orders, look at selection and diversification rates along the branches between herbivore vs non-herbivore insect orders.

I have about 180 genomes that I have selected and downloaded from NCBI assembly (Genbank), most of which do not have annotation and I wanted to know the best way to bulk annotate these reference genomes so that I can get the list of proteins and genes in each of the reference genome so that I can then extract GRs and ORs from all the genomes.

I have been looking at Braker and EggNog but it looks like it is made for annotating novel genomes and might be slow to bulk annotate.

Thank you in advance!

reference phylogenomics annotation • 1.4k views

ADD COMMENT • link updated 3.6 years ago by GenoMax 154k • written 3.6 years ago by nitinra ▴ 50

1

Entering edit mode

I find it hard to believe that you downloaded many genomes from NCBI that do not have annotations. I think it is more likely that the annotations are there, but maybe you didn't look in the correct place. If you tell us a couple of genomes you downloaded and from where, we may be able to offer advice.

ADD REPLY • link 3.6 years ago by Mensur Dlakic ★ 30k

0

Entering edit mode

Hi Mensur,

Here are some of the genomes (accession numbers) I downloaded (from Genbank):

GCA_912999745.1
GCA_000836215.1
GCA_000836235.1
GCA_914767665.1
GCA_013731165.1
GCA_001687245.1
GCA_012932325.1
GCA_002926335.1
GCA_910589645.1
GCA_015345945.1

Out of the 180 genomes I downloaded, only 52 had the associated .gff and protein.faa files.

ADD REPLY • link 3.6 years ago by nitinra ▴ 50

1

Entering edit mode

Mensur Dlakic posted links for RefSeq versions of the genomes but corresponding GenBank versions should be available following similar links. Replace GCF with GCA.

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/012/932/325/GCA_012932325.1_TpBJ-2018v1/

Some genomes may have GenBank versions but no RefSeq. e.g https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/926/335/GCA_002926335.1_tcristinae_2.1/ This genome seems to have only genbank flat file version available (no GFF).

https://github.com/jorvis/biocode/blob/master/gff/convert_genbank_to_gff3.py purportedly does GBFF to GFF conversions but you will need to verify that claim.

ADD REPLY • link 3.6 years ago by GenoMax 154k

score 2 · Accepted Answer · 2022-04-08

This is how I would find the files, and it works for at least two of the examples you listed above (I tried only two). Go to the NCBI main site and enter your accession numbers into the search box. When you get search results, scroll down to the "Assembly" link and open it. On the far right side of the assembly page select either access to RefSeq (if available) or GenBank FTP files. When I did it for two of your numbers, this is what showed up, and there are both .gff and .faa files available.