Building a Metatranscriptome database from existing databases
1
1
Entering edit mode
8.9 years ago
jeremy.cox.2 ▴ 130

Hello All,

Most of the identified bacterial genes are hypothetical. I want to build a database of any and all bacterial genes so I can do an RNA-seq simulation of a microbiome.

However, I have yet to find a pre-build database like this. Refseq database does include several RNA.fna files, but the sequences are a small subset of what is available since it is curated.

Can anyone advise on the best way to make an query to a repository like GenBank similar to "all prokaryote, nucleotide, protein encoding" sequences.

Any solution is good, even an un-elegant one, such as downloading and parsing through all the database files myself. I could really use guidance on how to approach the problem.

metagenomics RNA-Seq metatranscriptomics • 2.0k views
ADD COMMENT
2
Entering edit mode
8.9 years ago
Josh Herr 5.8k

I'm not sure of your rationale for creating your own RNA-seq simulation when there are numerous mock and real data metatranscriptomes for many environments, but especially for human microbiome (you didn't clarify -- the expressed reads would be different for differing microbiomes).

If you're looking for a master list of all bacterial coding regions, I would just pull all the bacterial genomes from NCBI using their FTP server and select all the protein encoding genes from the genomes using the GFF files. This shouldn't take long and probably downloading all the data would be the most time consuming part. Selectively downloading just the coding regions would be the best use of your time -- not sure if NCBI has a list of files. Searching FTP is troublesome.

ADD COMMENT
1
Entering edit mode

Thanks! This is exactly what I needed. NCBI had it all along. I also found a suggestion here ): the .ffn files contain only gene coding sequences.

ADD REPLY

Login before adding your answer.

Traffic: 2323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6