Hi Everyone, I have been given a task to compare one gene sequence among 50 strain of E.coli. For this study i have 50 genome scaffold file and one gene sequence file. Now my work will be compare the gene sequence among all the genome and compute a phylogenetic tree of the gene among all sequence. If anybody could point me in the right direction, I would be thankful!.
As far as I understand your question, for creating of that tree, you'll need to extract 50 sequences, one from each file representing a strain genome. Once you'll have these 50, you will need to align all them with clastalw and then make a tree. To get 50 sequences you can use blast - in this case i would build a db for each of 50 files and blast the gene against them, then pick a sequence which "seems like a gene i am looking for". This part - picking a sequence that looks like a gene - can be a bit tricky if you have only scaffolds. Another way to get 50 sequences would be to run prodigal on each of scaffolds fasta files getting sets of ORFs and then blast the gene sequence against predicted ORFs.
There's a pipeline for this. It's called hal. However, in my opinion, you get more reliable trees when you only include conserved genes that are unlikely to transfer horizontally (e.g. ribosomal proteins). Also, the super alignments hal constructs from bacterial genomes are ridiculously long (100k-200k aa), so if you go this route you can only make ML and MP trees (no Bayesian).