Question: Generating A Non-Redundant Gene Set
4
gravatar for toshnam
7.4 years ago by
toshnam620
Seoul, Republic of Korea
toshnam620 wrote:

Hi all,

I have a plant genome assembly and I did a gene prediction using AUGUSTUS. This plant is a triploid, so AUGUSTUS predicted too many genes (113,195 genes) than my expectation. I believe this gene prediction result contains many redundancy. How can I make a non-redundant gene set?

Thanks.

• 3.8k views
ADD COMMENTlink modified 7.4 years ago by SimonCB765150 • written 7.4 years ago by toshnam620
6
gravatar for SimonCB765
7.4 years ago by
SimonCB765150
SimonCB765150 wrote:

You could also use UCLUST http://www.drive5.com/usearch/, which functions in a similar way to CD-HIT. I've had better experiences with UCLUST than CD-HIT in the past.

An alternative is to run you sequences through BLAST. You can then define a similarity threshold (10%, 20%, etc.), and remove sequences from your dataset until no two sequences have a pairwise identity greater than your chosen threshold. BLAST would be time consuming to run on such a large dataset, but if you have the time to spare/sufficient computing power it will enable you to generate a dataset with less redundancy than either UCLUST or CD-HIT. I can explain further if you need more information on how to perform this.

UPDATE

As requested some more information on the BLAST based approach. The way I've done this in the past is to make a BLAST database of all the genes, and BLAST each individual sequence against the database. In your case this would give 113,195 BLAST output files. I then parsed these files, and generated one big csv file where each line was like: Gene A,Gene B,Similarity

If you know Perl, then you can have a look at the source code for PISCES http://dunbrack.fccc.edu/Guoli/pisces_download.php. It is designed for proteins, but with a few modifications it would work for genes. In order to remove the redundancy as PISCES does, you would also also need the length of the sequence of each protein/gene. You then make a list of all your genes, with the gene with the longest sequence at the top of the list.

Lets assume you want no two genes to have a sequence identity of greater than 20%, i.e. no two genes are more than 20% similar to each other. Start at the top of your ordered list, and mark the gene at the top of the list as being included in your non-redundant set. Look in your similarity file you made earlier, and any genes in that that have >20% similarity with the gene at the top of the list are marked as not being in the non-redundant set. Move down your ordered list, until you find the next gene that is not marked as being in or out of the non-redundant set. Mark this gene as being in the non-redundant set, and all genes it is >20% similar to as being not in the non-redundant set. Keep moving down the list in this manner, until you reach the end of the list. You now have a set of genes marked as being in a non-redundant set.

There is a slightly better method called BlastCuller (paper). It was designed with proteins in mind, but is really just a generic algorithm for problems like this. The algorithm in the paper is fairly simple to implement, but a bit harder to describe here.

If you're confident in your coding skills, I would have a go at writing your code so that it uses the BlastCuller algorithm. Otherwise you can implement the PISCES algorithm yourself, or alter some of their Perl code. I believe you would just have to change the BLAST program you use, they use PSI-BLAST, and change the code that parses the BLAST output.

Hope this helps.

Simon

ADD COMMENTlink modified 7.4 years ago by Michael Kuhn5.0k • written 7.4 years ago by SimonCB765150

Thank you for your reply. I want to know more detail information for using BLAST.

ADD REPLYlink written 7.4 years ago by toshnam620
4
gravatar for Michael Kuhn
7.4 years ago by
Michael Kuhn5.0k
EMBL Heidelberg
Michael Kuhn5.0k wrote:

A simplistic solution is to cluster sequences with CD-HIT.

ADD COMMENTlink written 7.4 years ago by Michael Kuhn5.0k

Just keep in mind that this will also cluster paralogs and repeats into a single representative.

ADD REPLYlink written 7.4 years ago by Torst900

Thanks to all. :-)

ADD REPLYlink written 7.4 years ago by toshnam620
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 678 users visited in the last hour