Any way to retrieve the annotation information of genomes under a bioproject ? (NCBI)
2
0
Entering edit mode
19 months ago

Hello! I want to retrieve the annotation data (tRNAs, rRNA genes) from genomes located under this bioproject https://www.ncbi.nlm.nih.gov/bioproject/729490

There are 677 genomes and by clicking on the "Genome-Annotaiton-Data" of each genome entry (e.g GCA_029245675.1) I can see information regarding to the number of tRNAs and rRNA genes. I need to retrieve a table that report the number of these genes per each genome of this bioproject, is there a way to do this for example by using E-utilities ? I'm looking for some commands to do this but I don't find anything related.

thanks for your time

NCBI PGAP Bioproject annotation • 992 views
ADD COMMENT
1
Entering edit mode
19 months ago
GenoMax 147k
$ esearch -db assembly -query GCA_029245675 | esummary | xtract -pattern DocumentSummary -element FtpPath_GenBank
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/245/675/GCA_029245675.1_ASM2924567v1

Find the feature table file in this directory: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/245/675/GCA_029245675.1_ASM2924567v1/GCA_029245675.1_ASM2924567v1_feature_table.txt.gz

That has detailed info you need.

If you need just a summary then get feature count file: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/245/675/GCA_029245675.1_ASM2924567v1/GCA_029245675.1_ASM2924567v1_feature_count.txt.gz

$ more GCA_029245675.1_ASM2924567v1_feature_count.txt
# Feature       Class   Full Assembly   Assembly-unit accession Assembly-unit name      Unique Ids      Placements
CDS     with_protein    GCA_029245675.1 GCA_029245695.1 Primary Assembly        1824    1824
CDS     without_protein GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      7
gene    RNase_P_RNA     GCA_029245675.1 GCA_029245695.1 Primary Assembly        1       1
gene    SRP_RNA GCA_029245675.1 GCA_029245695.1 Primary Assembly        1       1
gene    protein_coding  GCA_029245675.1 GCA_029245695.1 Primary Assembly        1824    1824
gene    pseudogene      GCA_029245675.1 GCA_029245695.1 Primary Assembly        7       7
gene    rRNA    GCA_029245675.1 GCA_029245695.1 Primary Assembly        3       3
gene    tRNA    GCA_029245675.1 GCA_029245695.1 Primary Assembly        49      49
gene    tmRNA   GCA_029245675.1 GCA_029245695.1 Primary Assembly        1       1
ncRNA   RNase_P_RNA     GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      1
ncRNA   SRP_RNA GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      1
rRNA            GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      3
tRNA            GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      49
tmRNA           GCA_029245675.1 GCA_029245695.1 Primary Assembly        na      1
ADD COMMENT
0
Entering edit mode

So as I understand that command gives a URL output where the requerid info is stored isn't it ?. the feature count table is what I need for each one of the genomes under that bio project. How did you retrieve the link of the feature table for that genome? (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/029/245/675/GCA_029245675.1_ASM2924567v1/GCA_029245675.1_ASM2924567v1_feature_count.txt.gz)

Thanks so much for your reply :)

ADD REPLY
1
Entering edit mode
19 months ago
vkkodali_ncbi ★ 3.8k

You can use NCBI Datasets to search for genomes by a bioproject and download data directly. In this instance, you can download the annotation data in GFF3 format as follows:

datasets download genome accession PRJNA729490 --annotated --include gff3

Once you have the GFF3 files, you can parse them to extract the information you need, including the counts of different feature types included in the annotation.

ADD COMMENT

Login before adding your answer.

Traffic: 900 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6