Question: Tools to find the unique proteins (without orthologs) in a bacterial taxon
0
gravatar for natasha.sernova
4.1 years ago by
natasha.sernova3.5k
natasha.sernova3.5k wrote:

Dear ALL,

I've used OMA-program to find a set of orthologous proteins in a bacterial taxon.

Unfortunately this very nice program does not give a set of unique proteins.

I know that OrthoMCL- and OrthoDB-tools  programs do it. I was not very succesful in these

proteins finding with these tools. Are there any other tolls to find the unique genes in a

bacterial taxon?

I need all proteins that do not have any orthologs in this taxon, only unique proteins or

singletons. Would you be so kind to give me some hints?

Thank you very much!

Natasha

ADD COMMENTlink modified 4.1 years ago by h.mon26k • written 4.1 years ago by natasha.sernova3.5k

How did you run OrthoMCL? What means "I was not very successful"? Did you get any results, or you did not even manage to get it running? What kind of data do you have? Predicted peptides from genomes, downloaded proteins from nr, etc?

My experience with OrthoMCL is it will output all clusters it finds, including clusters with only one gene and one taxon.

ADD REPLYlink written 4.1 years ago by h.mon26k

Sorry, I was not quite correct. I've got a list of singletons from OrthoMCL. I would like to make sure this list of proteins is correct and complete, and other programs give the same set. I took all proteins for this bacterial taxon from NCBI, faa-datasets for each bacteria from this taxon. I have thrown away just proteins shorter than 50 aa without annotation. I considered all the proteins longer than 100 aa without looking at their annotation.

2015-06-27 19:00 GMT+03:00 h.mon on Biostar <notifications@biostars.org>:

ADD REPLYlink written 4.1 years ago by natasha.sernova3.5k
1

If your datasets are the complete translated gene sets, then the OrthoMCL singletons is a good place to start. You could blast the singletons against the other genomes to see if they are there and were just not found.

If your data include translated transcriptomes, I do not know, because transcriptomes are often lacking genes due to non-expression on a particular tissue or developmental stage, or not sampled due to low expression.

ADD REPLYlink written 4.1 years ago by h.mon26k

My dataset includes all the proteins for a particular bacterial taxon, but only proteins that could be found in NCBI for some short period of time. They are the complete translated gene sets, so I will have to say, when exactly I ran a program (OrthoMCL) and necessarilly state, that I considered NCBI data only. Other databases may easily be more complete and have more sequenced and translated genomes, it is not my problem. To blast the singletons against the other genomes from the other taxons is not reqired, I need the information about this particular taxon only. I hope the option in OrthoMCL functions properly. I would check it with some independent program - I don't know such a program. My data do not include translated transcriptomes, so I don't worry about these difficulties.

2015-06-28 1:09 GMT+03:00 h.mon on Biostar <notifications@biostars.org>:

ADD REPLYlink written 4.1 years ago by natasha.sernova3.5k

It turned out the question has been already discussed. Tool for finding unique sets of proteins Even some tools are mentioned. I have to study this - it may help.

2015-06-28 1:09 GMT+03:00 h.mon on Biostar <notifications@biostars.org>:

ADD REPLYlink written 4.1 years ago by natasha.sernova3.5k
1

I'd recommend using Usearch:

usearch --cluster_fast proteins.fasta --id 0.70 --centroids proteins_centroids.fasta

ADD REPLYlink written 4.1 years ago by steven70

Dear Steven, Probably you have meant this program: http://drive5.com/usearch/manual/

In this example, usearch --cluster_fast proteins.fasta --id 0.70 --centroids proteins_centroids.fasta, what are the input and output fasta-files? I have not found the unique ptoteins option jn their site yet, sorry.

2015-06-30 0:30 GMT+03:00 steven on Biostar <notifications@biostars.org>:

I've found that the program may help to get rid of singletons.

USEARCH command for discarding singletons

usearch -sortbysize derep.fasta -output derep2.fasta -minsize 2

I'm afraid it won't help me. They provide hundreds of options, I have to study them. It's a great program! http://drive5.com/usearch/manual/all_opts.html

 

 

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by natasha.sernova3.5k
1

Hi Natasha, sorry I wasn't very specific in my previous comment. 0.70 is the recommended id for proteins when clustering. Here is the man page for id: http://drive5.com/usearch/manual/opt_id.html

I think if the id is set to 1.0 (sequences must be 100% matching), the sequences can be clustered into a set of unique sequences, but I have not tested this myself. In general, clustering is used to eliminate redundant sequences, and I have used usearch to great success doing this. Therefore I *think* using an id of 1.0, it might be possible to extend usearch to perform the unique clustering function you desire.

For more information on clustering take a look at the wiki page: https://en.wikipedia.org/wiki/Sequence_clustering

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by steven70

Dear Steven, Thank you very much! But it seems to me. that this approach will imply, that I already know my unique set proteins and compare these proteins with the database of all the proteins I have in this taxon.

But let's suppose I don't have any known proteins at all. How will it be better to start? The program is definitely knows how to search for the unique proteins, since there is a tool to get rid of them. It's a mystery...

2015-06-30 18:39 GMT+03:00 steven on Biostar <notifications@biostars.org>:

ADD REPLYlink written 4.1 years ago by natasha.sernova3.5k

Hi Natasha, the nice thing about usearch is that it compares each sequence to the other sequences in the file - so you don't need a reference file with unique proteins. So maybe try:

usearch --cluster_fast input.fasta --id 1.0 --centroids output.fasta

where input.fasta is the database of proteins you have from the taxon, and output.fasta is the file where any unique sequences will be output after clustering. After running usearch, you can use grep -c ">" input.fasta and then grep -c ">" output.fasta to see if the total number of sequences decreased. Again, I haven't tested usearch to find unique sequences but it might be worth a try!

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by steven70

Dear Steven, it is definitely worth a try, moreover I don't see any nice alternative way. Many thanks! I've just learnt about this program, who knows?

2015-06-30 22:33 GMT+03:00 steven on Biostar <notifications@biostars.org>:

ADD REPLYlink written 4.1 years ago by natasha.sernova3.5k
2
gravatar for h.mon
4.1 years ago by
h.mon26k
Brazil
h.mon26k wrote:

In addition to OrthoMCL, Proteinortho has a command-line option to output "singles" clusters. It also has an option to include synteny on cluster predictions, which may interest you.

Regarding steven suggestion of using usearch, I do remember of seeing (either on uclust / usearch manual, or on cd-hit manual) a quick and dirt orthologous prediction method by progressively clustering with less stringent similarity - but I can't find it again. Finally, there is an alternative to usearch, vsearch, which aims at being a faster and open-source drop-in replacement for usearch,

ADD COMMENTlink written 4.1 years ago by h.mon26k

Unfortunately the link for Proteinotho is not reliable. I failed to find any better link to it.

 

http://www.bioinf.uni-leipzig.de/Software/proteinortho/

The site above seems to be valid.

 

Dear colleagues, THANK YOU VERY MUCH FOR YOUR HELP!

2015-07-01 0:27 GMT+03:00 h.mon on Biostar <notifications@biostars.org>:

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by natasha.sernova3.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1549 users visited in the last hour