Entering edit mode
8.1 years ago
pawlowac ▴ 80
I'm looking at analysing the genetic context which my gene is found in among hundreds of genomes. I have the sequence for 5kb upstream and downstream of my gene. I have tried mauve, but it doesn't seem to handle this number of sequences at once.
My thought process is as follows;
- Identify conserved fragments of DNA (coding or non-coding) within the sequence
- Group sequences together that have those same fragments
- Use mauve to analyze a smaller number of more similar sequences
I'm not quite sure how to tackle 1 and 2. Using a global-alignment program (MAFFT) doesn't work here since I run into a memory shortage (I have 8 gb). Does anyone have a suggestion?
How about identifying all refseq genomes that have the same gene and retrieving the annotations within ± 5kb in those genomes? This wouldn't be computationally demanding and would probably be relatively easy to achieve with e.g. blast against refseq_genomic and then some entrez direct magic..
I've used an ebot (efetch) perl script to download all genomes associated with my protein GI numbers. Then, using biopython I've been able to extract annotations for +/-5 kb around my gene of interest. Do you have a suggestion for automatically comparing the sequences?
What do you hope to achieve from comparing the sequences that you did not find out from comparing the annotations?
I hope to identify potential sites of recombination, a comparison of sequence identity of surrounding genes and the average mutation rate between the region surrounding target genes compared to the average mutation rate of the target genes.
You don't say if you are looking at a populational level, close species comparison, or comparisons between a wider range of taxa.