Question: Building a Metatranscriptome database from existing databases
gravatar for jeremy.cox.2
4.5 years ago by
United States
jeremy.cox.290 wrote:

Hello All,

Most of the identified bacterial genes are hypothetical.  I want to build a database of any and all bacterial genes so I can do an RNA-seq simulation of a microbiome.

However, I have yet to find a pre-build database like this.  Refseq database does include several RNA.fna files, but the sequences are a small subset of what is available since it is curated.

Can anyone advise on the best way to make an query to a repository like GenBank similar to "all prokaryote, nucleotide, protein encoding" sequences.

Any solution is good, even an un-elegant one, such as downloading and parsing through all the database files myself.  I could really use guidance on how to approach the problem. 

ADD COMMENTlink modified 4.5 years ago by Josh Herr5.7k • written 4.5 years ago by jeremy.cox.290
gravatar for Josh Herr
4.5 years ago by
Josh Herr5.7k
University of Nebraska
Josh Herr5.7k wrote:

I'm not sure of your rationale for creating your own RNA-seq simulation when there are numerous mock and real data metatranscriptomes for many environments, but especially for human microbiome (you didn't clarify -- the expressed reads would be different for differing microbiomes).  

If you're looking for a master list of all bacterial coding regions, I would just pull all the bacterial genomes from NCBI using their FTP server and select all the protein encoding genes from the genomes using the GFF files.  This shouldn't take long and probably downloading all the data would be the most time consuming part.  Selectively downloading just the coding regions would be the best use of your time -- not sure if NCBI has a list of files.  Searching FTP is troublesome. 

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Josh Herr5.7k

Thanks!  This is exactly what I needed.  NCBI had it all along.
I also found a suggestion here (Where Can I Download Nucleotide Sequences Of Bacterial Genes? ): the .ffn files contain only gene coding sequences. 

ADD REPLYlink written 4.5 years ago by jeremy.cox.290
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1693 users visited in the last hour