How To Run Genome Wide Local Blast
2
1
Entering edit mode
8.5 years ago
HG ★ 1.2k

Hi Everyone, I have been given a task to compare one gene sequence among 50 strain of E.coli. For this study i have 50 genome scaffold file and one gene sequence file. Now my work will be compare the gene sequence among all the genome and compute a phylogenetic tree of the gene among all sequence. If anybody could point me in the right direction, I would be thankful!.

blast • 3.0k views
1
Entering edit mode
8.5 years ago
Pavel Senin ★ 1.9k

As far as I understand your question, for creating of that tree, you'll need to extract 50 sequences, one from each file representing a strain genome. Once you'll have these 50, you will need to align all them with clastalw and then make a tree. To get 50 sequences you can use blast - in this case i would build a db for each of 50 files and blast the gene against them, then pick a sequence which "seems like a gene i am looking for". This part - picking a sequence that looks like a gene - can be a bit tricky if you have only scaffolds. Another way to get 50 sequences would be to run prodigal on each of scaffolds fasta files getting sets of ORFs and then blast the gene sequence against predicted ORFs.

0
Entering edit mode

Thank you for your suggestion. If i would proceed like this : 1.Take all 50 genome sequence 2.Take the protein sequence of the specific gene 3. Run a local tblastn 4.extract all the sequences and run clustalw. 5. Generate the tree. Please comment on my idea.

0
Entering edit mode

sounds good. but, as i have mentioned, tblastn will give you a set of hits, which you'll need to somehow process in order to extract gene sequences - i.e eyeball and pick the best hit manually assigning start stop _if scaffolds are of good quality_... but what if there are gaps and mis-assemblies - what will be that best hit? using gene prediction may be better option: you run prodigal which will yield you protein sequences, then blastp those with your gene - pick a best hit putative gene sequence from each of 50 and proceed with the tree construction.

0
Entering edit mode
# Predicted gene
for file in ./*.fasta
do
echo $file ./prodigal.v2_60.linux -a out_trans -d out_nuc -i$file
mv out_trans $file_trans mv out_nuc$file_nuc
done


Hi i am trying to run all 50 genome like that but could you please check why it is not rename the out put file according to input file name.

0
Entering edit mode

shouldn't it be "$file"_trans and "$file"_nuc?

1
Entering edit mode
8.5 years ago
5heikki 10k

There's a pipeline for this. It's called hal. However, in my opinion, you get more reliable trees when you only include conserved genes that are unlikely to transfer horizontally (e.g. ribosomal proteins). Also, the super alignments hal constructs from bacterial genomes are ridiculously long (100k-200k aa), so if you go this route you can only make ML and MP trees (no Bayesian).

0
Entering edit mode

that is a cool stuff, thanks!

0
Entering edit mode

thank you for your suggestion. I appreciate your opinion but i have given a specific gene and i have to find if it is present or not among all the strain and i have to draw a tree among all the genome.

0
Entering edit mode

Yeah, sorry, I didn't get your question. So why don't you just blast the gene against all the genomes and extract the best hit in each case and go on from there? Shouldn't take very long at all..

0
Entering edit mode

I have extracted all the translated gene sequence from all 50 genome using Prodigal. Now i have to run a bath mode blast of all the sequence vs my query sequence and extract the max score gene and finally generate the tree. Could you please suggest me something , how to proceed next step. Although i had a look "hal" as you mention i want to try both the option.