How to align a protein set to a genome?
4
1
Entering edit mode
6.9 years ago
Dan ▴ 520

Hi,

I have a new genome assembly and I want to align the protein sequences of the original assembly against it. What is the best tool for this job?

protein genome alignment prediction annotation • 4.0k views
ADD COMMENT
4
Entering edit mode
6.9 years ago
Juke34 7.2k

Hi, several suggestions:

  1. If you want an approximated alignments you can use Pmatch or tblastn.
  2. If you want something precise, you can use exonerate or Genewise that give splice-aware alignments.

This publication reviews the performance of 7 tools doing spliced alignments from proteins (They look also at 12 tools doing DNA alignments):
Hiroaki Iwata and Osamu Gotoh Nucleic Acids Res. 2012 Nov; 40(20): e161. doi: 10.1093/nar/gks708

The second way is more time consuming if you use these tools directly. Often the two steps are coupled. The first step is used to define chunks of genome that will be send to the second step tools (e.g. within Maker and Ensembl annotation pipelines).

Cheers

ADD COMMENT
1
Entering edit mode

I think you meant tblastn and not Blasx

ADD REPLY
0
Entering edit mode

Right, I will update the post !

ADD REPLY
0
Entering edit mode

May I also ask here if Promer (MUMmer) has an option of aligning a proteome to a genome? According to what i see it uses only nucleotide sequences, am I missing smth?

ADD REPLY
1
Entering edit mode

In the MUMer4 publication they state "It is not restricted to DNA and can also align protein sequences". It is not clearly said in the manual but it looks you can proteome as input of Promer. This approach do not provide splice-aware alignment.

Another tool really fast would be PSimScan.

Otherwise if you look for splice aware alignment you could have a look at this publication they show performance of 7 different tools for protein alignments:
Hiroaki Iwata and Osamu Gotoh Nucleic Acids Res. 2012 Nov; 40(20): e161. doi:  10.1093/nar/gks708

ADD REPLY
0
Entering edit mode

I tried to align proteome to a genome using promer, but it treats proteins as IUPAC code for nt and turns to N all the letters that it does not recognize... Anyway I ll try to look more into their publication, thanks. PSimScan looks a good tool too! I just need an "approximate" alignment at this stage, but will look into the publication that you mentioned for my future reference, many many thanks!

ADD REPLY
0
Entering edit mode

Here is what they say on the MUMmer4.x README:

promer is for the protein level, all-vs-all comparison of nucleotide sequences contained in multi-FastA data files. The nucleotide input files are translated in all 6 reading frames and then aligned to one another via the same methods as nucmer.

I think it can only deal with nucleotides.

ADD REPLY
0
Entering edit mode

That's pity they don't check if the input is AA or DNA and skip the six frame translation if it is already protein. You should create an issue and ask if it could be implemented in a future version.

ADD REPLY
0
Entering edit mode

You're right. I will.

ADD REPLY
1
Entering edit mode
6.9 years ago
SES 8.5k

I would use blat or exonerate. Blat is better for more closely related species and the nice thing is that both will produce a blast table for easy parsing (though, with exonerate you have to use the 'roll-your-own' with a custom string, which I could share). Exonerate is used by Maker for protein alignments and it has a lot more options that allow you to control the splicing and intron modeling, codon alignment, etc. Blat is a lot faster, so that is a trade-off to consider.

ADD COMMENT
0
Entering edit mode
6.9 years ago
vassialk ▴ 200

NextGene software is good. Or learn Biopython with Biojava/Bioruby.

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Not a good reason to lock yourself in to a proprietary software just for something you can do easily with exonerate or Blastx.

ADD REPLY
0
Entering edit mode
6.8 years ago
jomaco ▴ 200

If you wish to align those proteins to a reference assembly you could use the exonerate (http://www.ebi.ac.uk/~guy/exonerate/) protein2genome model which models introns. I used this when I wanted to align proteins from the TAIR10 database to our reference genome. You would also probably want to split the file into considerably smaller chunks so that many faster individual alignments can be carried out before the results are merged - this way the alignment as a whole will be much quicker.

ADD COMMENT

Login before adding your answer.

Traffic: 1137 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6