Hi there, I'm new to Biopython and I've stalled with the following task.
I need to get a number of coding sequences from different bacterial genomes. The idea is to have a rather big dataset of different coding sequences. I need not less than 7000 coding sequences.
I don't care about the exact species but they must be not very related to each other. So, I think I need to take a number of bacterial genomes (each of them should be from 1 species of different genera) and retrieve somehow all of their CDSs. Maybe then I also need to filter them from homologs.
I tried to implement Biopython .esearch() and .efetch() methods but I can't invent a good way to retrieve my data with the conditions above. I'll appreciate any help and suggestions!
You could perhaps try downloading
representative
sequences, and, depending on how many there are, maybe take a random selection of accessions as a surrogate.7000 coding sequences isn't very many for bacterial genomes. A pretty normal enterobacteria like E. coli has between 4-5000 off the bat.
To my knowledge, there's no very easy way to get a single representative from each genera. This is a question that has come up for another excellent tool,
ncbi-genome-download
, and there isn't a simple answer other than taking a genome at random AFAICT.A 'rigorous' approach might be to download a large selection of genomes, and calculate a whole load of pairwise
mash
distances to figure out the most distantly related ones, then select those to 'harvest' CDSs from. This would probably take a good while and quite a bit of disk space/memory though.You can try editing my script, here: A: How to download all sequences of a list of proteins for a particular organism