Question: How To Run Genome Wide Local Blast
1
gravatar for HG
7.1 years ago by
HG1.1k
Germany
HG1.1k wrote:

Hi Everyone, I have been given a task to compare one gene sequence among 50 strain of E.coli. For this study i have 50 genome scaffold file and one gene sequence file. Now my work will be compare the gene sequence among all the genome and compute a phylogenetic tree of the gene among all sequence. If anybody could point me in the right direction, I would be thankful!.

blast • 2.5k views
ADD COMMENTlink modified 3.4 years ago by Biostar ♦♦ 20 • written 7.1 years ago by HG1.1k
1
gravatar for Pavel Senin
7.1 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

As far as I understand your question, for creating of that tree, you'll need to extract 50 sequences, one from each file representing a strain genome. Once you'll have these 50, you will need to align all them with clastalw and then make a tree. To get 50 sequences you can use blast - in this case i would build a db for each of 50 files and blast the gene against them, then pick a sequence which "seems like a gene i am looking for". This part - picking a sequence that looks like a gene - can be a bit tricky if you have only scaffolds. Another way to get 50 sequences would be to run prodigal on each of scaffolds fasta files getting sets of ORFs and then blast the gene sequence against predicted ORFs.

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Pavel Senin1.9k

Thank you for your suggestion. If i would proceed like this : 1.Take all 50 genome sequence 2.Take the protein sequence of the specific gene 3. Run a local tblastn 4.extract all the sequences and run clustalw. 5. Generate the tree. Please comment on my idea.

ADD REPLYlink written 7.1 years ago by HG1.1k

sounds good. but, as i have mentioned, tblastn will give you a set of hits, which you'll need to somehow process in order to extract gene sequences - i.e eyeball and pick the best hit manually assigning start stop _if scaffolds are of good quality_... but what if there are gaps and mis-assemblies - what will be that best hit? using gene prediction may be better option: you run prodigal which will yield you protein sequences, then blastp those with your gene - pick a best hit putative gene sequence from each of 50 and proceed with the tree construction.

ADD REPLYlink written 7.1 years ago by Pavel Senin1.9k
# Predicted gene
for file in ./*.fasta
    do
    echo $file 
    ./prodigal.v2_60.linux -a out_trans -d out_nuc -i $file
     mv out_trans $file_trans
     mv out_nuc $file_nuc
    done

Hi i am trying to run all 50 genome like that but could you please check why it is not rename the out put file according to input file name.

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by HG1.1k

shouldn't it be "$file"_trans and "$file"_nuc?

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Pavel Senin1.9k
1
gravatar for 5heikki
7.1 years ago by
5heikki9.4k
Finland
5heikki9.4k wrote:

There's a pipeline for this. It's called hal. However, in my opinion, you get more reliable trees when you only include conserved genes that are unlikely to transfer horizontally (e.g. ribosomal proteins). Also, the super alignments hal constructs from bacterial genomes are ridiculously long (100k-200k aa), so if you go this route you can only make ML and MP trees (no Bayesian).

ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by 5heikki9.4k

that is a cool stuff, thanks!

ADD REPLYlink written 7.1 years ago by Pavel Senin1.9k

thank you for your suggestion. I appreciate your opinion but i have given a specific gene and i have to find if it is present or not among all the strain and i have to draw a tree among all the genome.

ADD REPLYlink written 7.1 years ago by HG1.1k

Yeah, sorry, I didn't get your question. So why don't you just blast the gene against all the genomes and extract the best hit in each case and go on from there? Shouldn't take very long at all..

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by 5heikki9.4k

I have extracted all the translated gene sequence from all 50 genome using Prodigal. Now i have to run a bath mode blast of all the sequence vs my query sequence and extract the max score gene and finally generate the tree. Could you please suggest me something , how to proceed next step. Although i had a look "hal" as you mention i want to try both the option.

ADD REPLYlink written 7.1 years ago by HG1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1362 users visited in the last hour
_