Question: OrthoMCL issue with BLAST and mySQL
gravatar for merwright
7 months ago by
merwright0 wrote:

I'm having a problem with orthoMCL at the "orthomclPairs" stage; it is being caused from an earlier action between the BLAST and relational database step. Here is a recap of the steps for anyone familiar with the program but needed a reminder:

  • BLAST database built with all "goodProteins"
  • BLASTall using "goodProteins" as a query against the database. Output is in specified format.
  • orthomcl schema installed into a mySQL database
  • orthomclBlastParser takes blast output and makes new file for mySQL called similarSequences.
  • orthomclLoadBlast takes new file and creates table in mySQL with it called SimilarSequences.
  • orthomclPairs takes the first table and creates new tables.

The point all my runs fail is at orthomclPairs, where I get a duplicate key message:

~/orthomcl/test1$ DBD::mysql::st execute failed: Duplicate entry 'l3a|maker-scaffold_5922-exonerate_est2genome-gene-0.13-mRNA--the' for key 'best_hit_ix' at /share/apps/orthomcl-2.0.9/bin/orthomclPairs line 709, <F> line 15.

I've been troubleshooting that for a few weeks but it seems like what others have posted doesn't work for me. The 'best_hit_ix' is a table being made. Here is a short list of what I thought was a problem but did not end of fixing it.

1) Looking for a fix for this error directly led me to two discussions that seemed to just drop rows in the SimilarSequences table that were not unique:

This didn't seem difficult so I renamed the original table to keep as a backup and made a new table from this with their fix, in mySQL:

> rename table SimilarSequences to similarSequences;

> create table SimilarSequences as select distinct * from similarSequences;

Unfortunately I ran into the same problem again, so it seemed my issue wasn't identical.

2) I thought the SimilarSequences table had multiple query-subject combinations (which I thought was ok because there are multiple places in each protein that are found to align) and that it had to be removed. So I thought I would use mySQL to only keep distinct combinations of the subject and query:

create table test1 as select distinct query_id,subject_id from similarSequences;

select * from test1 group by query_id,subject_id having count(*)>1;

This also didn't make the program happy. I got an error message when I ran orthomclPairs.

3) Maybe the genome I am working on had deflines that made orthomcl unhappy. I subbed all dashes for underscores and reran the entire pipeline. Did not fix it.

4) I thought I should try modifying the BLAST procedure since I am more comfortable with controlling that. I modified all the arguments so it was similar to the blastall script, but in blastp instead. Using blastp, I can set a maximum for hsp which would be the equivalent (I thought) of limiting a single combination of query to subject. Here is the BLAST script the authors give:

-F 'm S' -v 100000 -b 100000 -z protein_database_size -e 1e-5

Here is the first blastall script I worked with:

blastall -p blastp -d goodprotdb -F 'm S' -v 100000 -b 100000 -z 36146 -e 0.01 -m 8 -i goodProteins.fasta -o blastall.out

And here is the blastp script I made which should give me the same format the orthomcl pipeline needs:

blastp -db goodprotdb -max_hsps 1 -num_descriptions 100000 -num_alignments 100000 -dbsize 36146 -evalue 0.01 -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -query goodProteins.fasta -o blastp.out

I also tried modifying the original blastall recipe with two variations, K and P:

blastall -p blastp -d goodprotdb -K 1 -F 'm S' -v 100000 -b 100000 -z 36146 -e 0.01 -m 8 -i goodProteins.fasta -o blastpK.out

blastall -p blastp -d goodprotdb -P 1 -F 'm S' -v 100000 -b 100000 -z 36146 -e 0.01 -m 8 -i goodProteins.fasta -o blastpP.out

These options stand for

-P 0 for multiple hit, 1 for single hit

-K Number of best hits from a region to keep.

I couldn't determine completely from online resources if these options were what I thought they were, but if duplicate entries were causing the failure, I thought I would at least do the blast and compare them to the original. The plain 'blastall' and 'blastall' with -P are done, and I don't think that made a difference because they are the same size. The 'blastall' with -K and 'blastp' are still running, and the file size for that is about half, so I know it is trimmed down.

What I am unsure of about this approach is if I am going to run into problems with the results. I could not determine for sure if the multiple hits from each protein-protein pair were used in the calculation of percent match in the orthomclBlastParser program, or if only having a single result for each match was sufficient to run the program. I looked at the code for this script and couldn't decide, so I thought I would at least try the variations and see if one allowed the program to reach the end successfully.

Feedback/ideas/suggestions all welcome, I've been working on this for a few weeks and our university help desk is also not sure how to fix it.

ADD COMMENTlink modified 12 weeks ago by hudak70 • written 7 months ago by merwright0

Save yourself a lot of hassle and just switch to using Roary instead.

ADD REPLYlink written 7 months ago by jrj.healey6.1k

Hi jrj,

Thanks for the idea, I have not heard of that program. Briefly looking at the program, I don't think it would be an option for this project. I am working on microbial eukaryotes, and this relies on using prokka for starting the pipeline. Annotation for Eukaryotes is more challenging, and I am working on organisms that are divergent from what is well-known in databases. prokka is best for bacterial and archaeal genomes, and would miss a lot of gene calls for me, especially alternate transcripts.

However I am doing a pangenome of archaea and this might be a great tool in addition to anvio, so I think i'll look into it a bit more. Thank you!

ADD REPLYlink modified 7 months ago • written 7 months ago by merwright0

Ah yeah, didn’t realise you were using eukaryotic data, no worries.

ADD REPLYlink written 7 months ago by jrj.healey6.1k
gravatar for danilo.trabuco
7 months ago by
danilo.trabuco30 wrote:

Hi merwright,

I had the same problem, a duplicate entry after ran orthomclPairs! I checked my nohup.out file and I realized that the sequence name was interrupted (incomplete). I read the mysql tutorial and I found that the name length is 64 characters. If you check your entry, it ialso has 64 character and is incomplete, right ('l3a|maker-scaffold_5922-exonerate_est2genome-gene-0.13-mRNA--the')? Maybe you have more sequences with the same name, because of the short length name! In this case, we will have to reduce the name of all sequences and do everything again! I'm trying to find a best way to do that. If you have some idea please contact me!

Best, Dan

ADD COMMENTlink written 7 months ago by danilo.trabuco30

Personally I've stopped using OrthoMCL and started using Orthofinder, too many problems around MySQL especially on clusters. Plus Orthofinder gives you nice preliminary trees and more stats.

You may also want to take a look at PorthoMCL: It's a reimplementation of OrthoMCL without the need for MySQL, should give you the same results (but I haven't tested that)

ADD REPLYlink written 7 months ago by Philipp Bayer5.6k

Hi Philipp,

PorthoMCL was actually the first on my list of alternatives I found in a 2017 review, and OrthoFinder was another alt. Would you recommend one of those two first? I will be looking at very different organisms--species from each major eukaryotic supergroup. The review by Nichio et al. said they both had a number of dependencies as the main disadvantage.

Thanks for the advice!

ADD REPLYlink written 7 months ago by merwright0

Personally I prefer Orthofinder, in my work with plant resistance genes I got some very weird clusters with OrthoMCL/PorthoMCL while the Orthofinder clusters made sense based on R-gene classes.

I still haven't understood how Orthofinder makes the species tree, though. But it also makes a file with the single copy orthologous clusters so you can concatenate those for a 'standard' MSA and phylogeny analysis.

You can see benchmarking results for a whole bunch of orthology programs:

ADD REPLYlink written 7 months ago by Philipp Bayer5.6k


You were absolutely right, the schema installed on OMCL is insufficient to handle full names. Thank you for responding!


ADD REPLYlink written 7 months ago by merwright0

Hi Dan, I wanted to follow up with how to change MySQL to allow original names to be used.

After going into MySQL, use this command to alter the column width:

alter table SimilarSequences change SUBJECT_TAXON_ID SUBJECT_TAXON_ID varchar(250)

You will have to do this with all columns using characters in all tables, so look at table properties.

The other option is to simplify the contig names, but I didn't want to do this because it made it difficult to follow my contigs.

Best, Matt

ADD REPLYlink written 7 months ago by merwright0
gravatar for hudak7
12 weeks ago by
hudak70 wrote:

Did altering your blast script end up working? I am in the same boat.

ADD COMMENTlink written 12 weeks ago by hudak70

@hudak7 The problem was the SQL had a limited column width when OrthoMCL set up the database. It wasn't a terribly hard fix once that was figured out, you had to extend the characters allowed in columns because MAKER creates long names with annotation:

alter table SimilarSequences change SUBJECT_TAXON_ID SUBJECT_TAXON_ID varchar(250)

Hope that solves the problem because the BLAST was not the issue.

ADD REPLYlink written 12 weeks ago by merwright0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1369 users visited in the last hour