Hi all.
Need some advice on how to proceed with phylogenetics analysis. I got a fasta file with amino acid sequences from human. (There are about 270 sequences - this number isnt important - just mentioning it)
>chr1:228672725-228672991
tcgatctcttgatcttgtgatccacctgcctcagcctcccaaagtgctgggattacaggcgtgtgccaccacatccag...
>chr1:941749-942423
TAAGATGGGATCCAGCAGTGCGAGACTGTGGCCCAGGTCAGATGGTGGCAGCTCGGCCTTCCTGGT...
..
.
I would like to blast these sequences against different species, then multi align them and have a maf ouput in order to feed it to phast (http://compgen.cshl.edu/phast/). My questions are: 1) Can i blast all my sequences to species of my desire, and how? 2) Should i blast one by one of my sequences (of my choosing) to species of my desire, and how? 3) I am trying to find conservation scores between species and phylogenetic models (phastcons and phylofit), is my approach correct?
Any help is much needed. Thanks a lot.
P.S. If i didnt explain something correctly, its because this is something very new to me. Thank you for your understanding
1) Can i blast all my sequences to species of my desire,
By using the
limit taxIDoption with blast+Thanks. But which database should i use?
If you use a large database like
ntornrthen you can limit the searches to any number of taxonomy ID's using the method above.i used this command that i found from an older post of yours, because i have amino acid sequences and i want to multi align them later on so i can create a phylogeny tree:
Now i got files from "nt.00.tar.gz + nt.00.tar.gz.md5" to "nt.22.tar.gz + nt.22.tar.gz.md5". Now what is the next step? Extract them (guessing into 1 file because makeblastdb requires as input a fasta file?) and then try ./makeblastdb ?
And then try
Sorry but i am stuck, i cant use ncbi online because cpu limit is reached
The
.md5contain md5hash values that are used for checking the integrity of the downloadedntfiles. You can leave them as is. Uncompress all the other nt.tar.gzfiles by usingtar -zxvf nt.tar.gz` (will take some time). Keep all the files that result in one directory.outfmt 6is generally used if you want to parse the file using programmatic means. See the description of the format here. You will need to includestaxidsin your command if you want to distinguish where the hits are.thanks again.
i tried:
but produced this error:
If you put all the
ntfiles in/bin/blast_nt_dbdirectory then you need to supply basename of the blast index to-dbcommand as-db /bin/blast_nt_db/nt.Multiple taxID should be separate by
,not;.Thanks, i am trying to run the command now. Although when typing ./blastn -help, the option for staxids :
I thought you wanted to restrict your search to specific
taxID. That needs to be done on the command line as noted here : C: How to Blast with multiple species - Phylogenetic Analysisstaxidoption is for-outfmt 6format to display the taxID in the result.Yes i do. But on this post C: How to Blast with multiple species - Phylogenetic Analysis , you mentioned staxids, not staxid. Staxids need ";" for separation. For staxid -help doesnt mention anything. So i should use staxid with "," ? Thanks again
staxidsoption is for formatting blast results that are being written to the result file. Without that you would not know which taxID the result belongs to.Your command should be something like:
I ran the command three times, like so:
All same same size (2.5gb) and i think they are identical.
Now, how should i proceed selecting for each of my sequences, the 1-2 top hits of each taxonomy? Or should i proceed some other way ?
Did you not read my last comment? This is not the way to run that blast command. Scroll to the right in the command I have in last comment.
my bad, didnt see. Just ran it, works! Thanks
Now, in order to do the multi alignment and also create a phylogeny tree, i should pick the 1-2 top hits of each taxonomy of each of my sequences?
Or some other way?
Thanks a lot!!!