Question

Download multiple bacterial CDS from NCBI using Biopython

0

Entering edit mode

4.5 years ago

lisssanka • 0

Hi there, I'm new to Biopython and I've stalled with the following task.

I need to get a number of coding sequences from different bacterial genomes. The idea is to have a rather big dataset of different coding sequences. I need not less than 7000 coding sequences.

I don't care about the exact species but they must be not very related to each other. So, I think I need to take a number of bacterial genomes (each of them should be from 1 species of different genera) and retrieve somehow all of their CDSs. Maybe then I also need to filter them from homologs.

I tried to implement Biopython .esearch() and .efetch() methods but I can't invent a good way to retrieve my data with the conditions above. I'll appreciate any help and suggestions!

CDS Biopython NCBI • 995 views

ADD COMMENT • link 4.5 years ago by lisssanka • 0

1

Entering edit mode

You could perhaps try downloading representative sequences, and, depending on how many there are, maybe take a random selection of accessions as a surrogate.

7000 coding sequences isn't very many for bacterial genomes. A pretty normal enterobacteria like E. coli has between 4-5000 off the bat.

To my knowledge, there's no very easy way to get a single representative from each genera. This is a question that has come up for another excellent tool, ncbi-genome-download, and there isn't a simple answer other than taking a genome at random AFAICT.

A 'rigorous' approach might be to download a large selection of genomes, and calculate a whole load of pairwise mash distances to figure out the most distantly related ones, then select those to 'harvest' CDSs from. This would probably take a good while and quite a bit of disk space/memory though.

ADD REPLY • link 4.5 years ago by Joe 21k

0

Entering edit mode

You can try editing my script, here: A: How to download all sequences of a list of proteins for a particular organism

ADD REPLY • link 4.5 years ago by Kevin Blighe 87k