Question: Some Questions About Using Orthomcl To Find Orthologs Within Many Species
3
gravatar for User 7478
7.8 years ago by
User 747830
User 747830 wrote:
  1. When I follow the OrthoMCL User to do my work, I use orthomclAdjustFasta to produce a compliant fasta file, and each protein in the file have a definition line in the following format: >xxx|yyyyyyyy. But when I run

    blastall(blastall -i ALL_goodProteins.fasta -d BLL_goodProteins.fasta -p blastp -e 1e-10 -m 8 -o A-to-B.txt), there are some error reports like these: [blastall] ERROR: SeqPortNew: lcl|172_BLL_goodProteins.fasta stop(449) >= len(367) [blastall] ERROR: SeqPortNew: lcl|172_BLL_goodProteins.fasta start(450) >= len(367) [blastall] ERROR: SeqPortNew: lcl|172_BLL_goodProteins.fasta start(459) >= len(367) [blastall] ERROR: SeqPortNew: lcl|172_BLL_goodProteins.fasta start(531) >= len(367)

---I think maybe all sequences of "BLL|yyyyy" or "ALL|yyyyyyy" are saw as repeat ids.

  1. So, then I use uncompliant fasta file(each protein only has a definition line >yyyy) to do NCBI BLAST -m 8. While when I input my blast results to orthomclBlastParser, I only got a vacant file named similiarSequences.txt.

Anyone can help me? Thank you very much!

fasta orthomcl conversion • 4.6k views
ADD COMMENTlink modified 7.5 years ago by Damian Kao15k • written 7.8 years ago by User 747830
7
gravatar for Damian Kao
7.8 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Does your header line have spaces? Blast reads anything up to the first space as the blast ID. If you have two entries that have the same name up to the first space, it can cause the error you described. For example:

>SMA|ID 02919
XXXXXXXXXXXXXXX
>SMA|ID 02399
XXXXXXXXXXXXX

For blast, both of those sequences would have the same ID, " SMA|ID", causing an error.

ADD COMMENTlink written 7.8 years ago by Damian Kao15k
1

OrthoMCL really just needs those first three characters for it to distinguish between the two datasets when you do the all vs all blast. Whatever is after the 'XXX|' is the just the ID of the sequence in the data set which needs to be unique and without spacing for the blast to work. So if you just reformat your fasta files so there is no spacing in the ID field, it should work.

ADD REPLYlink written 7.8 years ago by Damian Kao15k

Yes,I think that might be the reason. But the "orthoMCL User" tells me "each protein in those files must have a definition line in the following format: >xxxx|yyyyyyyy ", or else I can not do next steps such as orthomclBlastParser

ADD REPLYlink written 7.8 years ago by User 747830

Thank you very much! Each of my original sequence ID contains a space and when I remove it, I can do blastall successfully!

But I have another problem. When I do orthomclBlastParser like this: orthomclBlastParser Hsa-Ath.txt Ath >>similarSequences.txt
-----"Hsa-Ath.txt" is the BlAST output in m8 format. -----"Ath" is the directuory of compliant fasta files as produced by orthomclAdjustFasta

But it tells me "couldn't find taxon for gene '2_Ath.fasta' at /opt/bin/orthomclBlastParser line 103, <F> line 1."??? Could you help me?Thank you!

ADD REPLYlink written 7.8 years ago by User 747830

I have a similar problem at Blast Gives Cryptic Errors but I don't see any spaces.

ADD REPLYlink written 6.8 years ago by hbw70

if u wanted to use orthomclAdjust fasta on this you would want to 3 for the location of the ID because that script interprets spaces and line brake characters in the header as field separation... unless you want to keep whatever word is in the place of ID then you would want to remove the space between ID and 02919

ADD REPLYlink written 6.1 years ago by sburlce0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1965 users visited in the last hour