How to download all available sequences of a gene from all bacteria using R

0

Entering edit mode

5.3 years ago

mschmidt ▴ 80

I need to download all/many sequences of a specific bacterial gene from Genbank nuccore database from entries limited to complete genome sequences. I prefer using R. Querying: 'Bacteria[ORNG] AND gyrB[GENE] AND complete genome[TI] ' in web interface results in >10k hits. I do not want to download whole genome sequences but only extracted gyrB sequences to make a local database. I tried

library(rentrez):
db = "nuccore"
query = "Bacteria[ORGN] AND gyrB[GENE] AND complete[TI]" 
found = entrez_search(db, query, config = NULL, retmode = "xml", use_history = FALSE, retmax = 90000)

but this fetch ids for whole genome sequences. Is it possible to get fasta sequences for gryB genes or at least gyrB coordinates however I'm not into downloading whole genome sequences of thousands of genomes.

R sequence gene genbank • 1.5k views

ADD COMMENT • link updated 5.1 years ago by Biostar 20 • written 5.3 years ago by mschmidt ▴ 80

0

Entering edit mode

You can get this data from Ensembl bacteria using the Ensembl Genomes perl API or maybe using the R package biomartr.

ADD REPLY • link 5.3 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

It would be a great option but I found that BioMart is not currently available for Ensembl Bacteria. https://support.bioconductor.org/p/82585/

ADD REPLY • link 5.3 years ago by mschmidt ▴ 80

Login before adding your answer.