Question: Building a Metatranscriptome database from existing databases
4.1 years ago by
United States
jeremy.cox.290 wrote:

Hello All,

Most of the identified bacterial genes are hypothetical.  I want to build a database of any and all bacterial genes so I can do an RNA-seq simulation of a microbiome.

However, I have yet to find a pre-build database like this.  Refseq database does include several RNA.fna files, but the sequences are a small subset of what is available since it is curated.

Can anyone advise on the best way to make an query to a repository like GenBank similar to "all prokaryote, nucleotide, protein encoding" sequences.

Any solution is good, even an un-elegant one, such as downloading and parsing through all the database files myself.  I could really use guidance on how to approach the problem. 

4.1 years ago by
Josh Herr5.6k
University of Nebraska
Josh Herr5.6k wrote:

I'm not sure of your rationale for creating your own RNA-seq simulation when there are numerous mock and real data metatranscriptomes for many environments, but especially for human microbiome (you didn't clarify -- the expressed reads would be different for differing microbiomes).  

If you're looking for a master list of all bacterial coding regions, I would just pull all the bacterial genomes from NCBI using their FTP server and select all the protein encoding genes from the genomes using the GFF files.  This shouldn't take long and probably downloading all the data would be the most time consuming part.  Selectively downloading just the coding regions would be the best use of your time -- not sure if NCBI has a list of files.  Searching FTP is troublesome. 

Thanks!  This is exactly what I needed.  NCBI had it all along.
I also found a suggestion here (Where Can I Download Nucleotide Sequences Of Bacterial Genes? ): the .ffn files contain only gene coding sequences. 

