Dear ALL,
I've used OMA-program to find a set of orthologous proteins in a bacterial taxon.
Unfortunately this very nice program does not give a set of unique proteins.
I know that OrthoMCL- and OrthoDB-tools programs do it. I was not very succesful in these
proteins finding with these tools. Are there any other tolls to find the unique genes in a
bacterial taxon?
I need all proteins that do not have any orthologs in this taxon, only unique proteins or
singletons. Would you be so kind to give me some hints?
Thank you very much!
Natasha
How did you run OrthoMCL? What means "I was not very successful"? Did you get any results, or you did not even manage to get it running? What kind of data do you have? Predicted peptides from genomes, downloaded proteins from nr, etc?
My experience with OrthoMCL is it will output all clusters it finds, including clusters with only one gene and one taxon.
Sorry, I was not quite correct. I've got a list of singletons from OrthoMCL. I would like to make sure this list of proteins is correct and complete, and other programs give the same set. I took all proteins for this bacterial taxon from NCBI, faa-datasets for each bacteria from this taxon. I have thrown away just proteins shorter than 50 aa without annotation. I considered all the proteins longer than 100 aa without looking at their annotation.
2015-06-27 19:00 GMT+03:00 h.mon on Biostar <notifications@biostars.org>:
If your datasets are the complete translated gene sets, then the OrthoMCL singletons is a good place to start. You could blast the singletons against the other genomes to see if they are there and were just not found.
If your data include translated transcriptomes, I do not know, because transcriptomes are often lacking genes due to non-expression on a particular tissue or developmental stage, or not sampled due to low expression.
My dataset includes all the proteins for a particular bacterial taxon, but only proteins that could be found in NCBI for some short period of time. They are the complete translated gene sets, so I will have to say, when exactly I ran a program (OrthoMCL) and necessarilly state, that I considered NCBI data only. Other databases may easily be more complete and have more sequenced and translated genomes, it is not my problem. To blast the singletons against the other genomes from the other taxons is not reqired, I need the information about this particular taxon only. I hope the option in OrthoMCL functions properly. I would check it with some independent program - I don't know such a program. My data do not include translated transcriptomes, so I don't worry about these difficulties.
2015-06-28 1:09 GMT+03:00 h.mon on Biostar <notifications@biostars.org>:
It turned out the question has been already discussed. Tool for finding unique sets of proteins Even some tools are mentioned. I have to study this - it may help.
2015-06-28 1:09 GMT+03:00 h.mon on Biostar <notifications@biostars.org>:
I'd recommend using Usearch:
usearch --cluster_fast proteins.fasta --id 0.70 --centroids proteins_centroids.fasta
Dear Steven, Probably you have meant this program: http://drive5.com/usearch/manual/
In this example, usearch --cluster_fast proteins.fasta --id 0.70 --centroids proteins_centroids.fasta, what are the input and output fasta-files? I have not found the unique ptoteins option jn their site yet, sorry.
2015-06-30 0:30 GMT+03:00 steven on Biostar <notifications@biostars.org>:
I've found that the program may help to get rid of singletons.
USEARCH command for discarding singletons
usearch -sortbysize derep.fasta -output derep2.fasta -minsize 2
I'm afraid it won't help me. They provide hundreds of options, I have to study them. It's a great program! http://drive5.com/usearch/manual/all_opts.html
Hi Natasha, sorry I wasn't very specific in my previous comment. 0.70 is the recommended id for proteins when clustering. Here is the man page for id: http://drive5.com/usearch/manual/opt_id.html
I think if the id is set to 1.0 (sequences must be 100% matching), the sequences can be clustered into a set of unique sequences, but I have not tested this myself. In general, clustering is used to eliminate redundant sequences, and I have used usearch to great success doing this. Therefore I *think* using an id of 1.0, it might be possible to extend usearch to perform the unique clustering function you desire.
For more information on clustering take a look at the wiki page: https://en.wikipedia.org/wiki/Sequence_clustering
Dear Steven, Thank you very much! But it seems to me. that this approach will imply, that I already know my unique set proteins and compare these proteins with the database of all the proteins I have in this taxon.
But let's suppose I don't have any known proteins at all. How will it be better to start? The program is definitely knows how to search for the unique proteins, since there is a tool to get rid of them. It's a mystery...
2015-06-30 18:39 GMT+03:00 steven on Biostar <notifications@biostars.org>:
Hi Natasha, the nice thing about usearch is that it compares each sequence to the other sequences in the file - so you don't need a reference file with unique proteins. So maybe try:
usearch --cluster_fast input.fasta --id 1.0 --centroids output.fasta
where input.fasta is the database of proteins you have from the taxon, and output.fasta is the file where any unique sequences will be output after clustering. After running usearch, you can use
grep -c ">" input.fasta
and thengrep -c ">" output.fasta
to see if the total number of sequences decreased. Again, I haven't tested usearch to find unique sequences but it might be worth a try!Dear Steven, it is definitely worth a try, moreover I don't see any nice alternative way. Many thanks! I've just learnt about this program, who knows?
2015-06-30 22:33 GMT+03:00 steven on Biostar <notifications@biostars.org>: