Error running OMA standalone at HOG computation step
4
1
Entering edit mode
18 months ago
gaurav.diwan ▴ 20

I have been trying to run OMA standalone using a few genomes downloaded from the OMA browser and then adding a couple of my custom genomes. All the steps of OMA standalone including All-vs-All computations, pairwise orthologs etc have successfully finished. However, I have consistently faced the same error with the final step of generating and storing HOGs for these genomes.

Here is the error message that I get despite trying to run OMA on different machines:

********************************************************************************

Storing results
Traceback (most recent call last):
File "/home/orthologs/OMA.2.4.1/bin/..//.venv/bin/warthogs.py", line 128, in <module>
run_gethogs()
File "/home/orthologs/OMA.2.4.1/bin/..//.venv/bin/warthogs.py", line 95, in run_gethogs
Settings.check_consistency_argument()
File "/home/orthologs/OMA.2.4.1/.venv/lib64/python3.6/site-packages/gethogs/settings.py", line 53, in check_consistency_argument
cls.inputfile_handler = file_manager.inputfile_handler_factory()
File "/home/orthologs/OMA.2.4.1/.venv/lib64/python3.6/site-packages/gethogs/file_manager.py", line 97, in inputfile_handler_factory
return OmaStandaloneFiles(os.path.normpath(join(settings.Settings.pairwise_folder, '..')))
File "/home/orthologs/OMA.2.4.1/.venv/lib64/python3.6/site-packages/gethogs/file_manager.py", line 44, in __init__
_csv.Error: field larger than field limit (131072)
********************************************************************************
An error occured in bottom-up HOG computations:

--- WARTHOGs:
- Start at 13:36 on 2020-12-08
- Orthology relations folder:Output/PairwiseOrthologs (standalone format)
- Method use to merge HOGS: pair
- Output file: Output/HierarchicalGroups.orthoxml

Error, (in GetHOGsBottomUp) failed to compute bottom-up hogs


Also is there a way to only run the HOG computation stage of the program without having to generate all the output files again?

OMA HOG error orthologs • 844 views
4
Entering edit mode
18 months ago

Hi Gaurav, (hi Alberto)

I had a look at the problem and it turns out that there are some proteins with a huge number of crossreferences in the fasta header. The csv module has a limit on the length of each field, which is exceeded by these proteins. The fix is quite simple, so you can actually do it yourself if you want. I also plan to release a new version of oma standalone within the next days that contains that fix (will be 2.4.2).

what you need to do is to include in hog_bottom_up/gethogs/file_manager.py right after the import lines:

csv.field_size_limit(1<<20)


After this change, the analysis run through in a few minutes without further problems.

Thanks for pointing out the problem and good look with our analysis

1
Entering edit mode

This finally worked! Although if I may point out, I also had to copy the file hog_bottom_up/gethogs/file_manager.py to the folder .venv/lib64/python3.6/site-packages/gethogs/ so that the program uses this modified file.

A couple of days ago, I had tried to add the csv.field_size_limit argument in file_manager.py that was in the lib64 folder, however this file was overwritten in a subsequent run of OMA. I'm guessing in a new run of OMA, the files in hogs_bottom_up/gethogs/ are installed/copied into lib64, however this did not happen today for some reason and I got the same "too many fields" error. Only after copying the new file_manager.py to lib64 was I able to finish the HOG computation.

Thanks a lot for all your help! Good luck with the further development of the program. I'm a fan and would be using it for a long time ;)

1
Entering edit mode

Thank you very much for your help. As Gaurav mentioned, it becomes necessary to copy the file hog_bottom_up/gethogs/file_manager.py to the folder .venv/lib64/python3.6/site-packages/gethogs/ , otherwise it doesn't use the modified file. I had to do this same thing as well, so I guess that's an additional confirmation.

As I said above thanks a lot to you and everyone for their help; it was a good a coincidence that this happened to two people at the same time. All the best!

2
Entering edit mode
18 months ago

Hi Gaurav,

indeed, this looks quite strange. regarding your question to rerun only the HOG inference step, in case you run OMA standalone with -d 2 (a bit more verbose output) you should find the command to start the HOG computation step in the output look, just above the extract you showed. Otherwise, you can try the following command:

  /home/orthologs/OMA.2.4.1/.venv/bin/warthogs.py -m pair -o Output/HierarchicalGroups.orthoxml -i Output/PairwiseOrthologs -t standalone -p 65 -s Output/ManualSpeciesTree.nwk


if the problem still remains, it would be helpful if you could share your output folder with us.

0
Entering edit mode

Thanks a lot for your answer. I tried running the HOG computation step using the second command you suggested and it still shows me the same csv_reader related error (too many fields). I can share the Output folder with you. Would you like the entire folder or just a few files from it? Also what would be a safe way to transfer this data (perhaps an email with a link)?

Best regards, Gaurav

2
Entering edit mode

Hi Gaurav,

yes, en email with a link would be fine. The relevant files would be the ManualSpeciesTree.nwk, Map-SeqNum-ID.txt and the whole subfolder PairwiseOrthologs (all in Output directory). If you can send me that as a tarball (maybe with wetransfer) to adrian.altenhoff@inf.ethz.ch.

0
Entering edit mode

Hi everybody,

I hope it is fine to leave this reply here. I am experiencing exactly the same issue as Gaurav, trying to run OMA standalone with genomes downloaded from OMA browser plus a custom genome. I get the same error code, and it stops on the same step. I have been following all the advises provided to Gaurav, to no avail in my case either. I would be happy to provide any extra information too, or at least reporting here for the record that someone else has the same issue.

Thank you very much, Alberto

1
Entering edit mode
18 months ago
alex.wv ▴ 50

Hi Gaurav,

I work in the group that develops OMA. I see that the error you have incurred is when we're reading IDs from the Output/Map-SeqNum-ID.txt file.

It appears that this is either corrupt or a sequence header in your input FASTA files might be malformed. The error message means that there are more columns than the default limit for the CSV library in Python (131072). This file should only contain 3 columns though (species, offset, FASTA header), so something is going wrong.

Please could you check if this file (Output/Map-SeqNum-ID.txt) looks correct? I suspect that it is corrupt, as even if there are tabs in the FASTA header I wouldn't expect there to be 131,000!

Every time you rerun it should delete the "Output" directory, including this file, so it would have to be corrupt multiple times. We often encounter strange behaviour like this when disk quota has been exceeded when running on a shared computing facility. Is this a possibility here?

If you want to rerun and are happy that some of your desired output files are valid, then you can set a new Output folder in parameters.drw (i.e., change "OutputFolder := 'Output';" ) and also set some of the output to "false" in this file too, if you wanted them and they're already valid. (e.g., search for WriteOutput_OrthologousPairs_orthoxml for the most time consuming).

Best wishes, Alex

0
Entering edit mode

Hi Alex,

Thanks a lot for your answer! Well, the FASTA headers of my custom genome files come from Uniprot and contain a bunch of details. However, I used the following command to check how many columns each line of Output/Map-SeqNum-ID.txt file has:

awk --field-separator="\t" '{print NF}' Output/Map-SeqNum-ID.txt | uniq -c
1 1 #the first comment line
682780 3 #all remaining lines


So I see that all the lines of this file have exactly 3 tab-separated columns. So, I don't understand why the CSV library thinks I have more than that in any row. Do you think I should reformat the FASTA headers for my genome files and recalculate everything?

As far as I can tell the Output folder is being deleted every time I run the third part of OMA. I gather that OMA itself has a "rm -rf" command for the Output folder that runs when I start the program. Also I don't think there are any disk quota issues as I'm using our group server where there is plenty of space and no restrictions.

Thanks a lot for your suggestions regarding the output, I will keep these in mind.

Best regards, Gaurav