Question

Generating A Non-Redundant Gene Set

4

Entering edit mode

12.4 years ago

toshnam ▴ 650

Hi all,

I have a plant genome assembly and I did a gene prediction using AUGUSTUS. This plant is a triploid, so AUGUSTUS predicted too many genes (113,195 genes) than my expectation. I believe this gene prediction result contains many redundancy. How can I make a non-redundant gene set?

Thanks.

• 6.4k views

ADD COMMENT • link updated 12.4 years ago by SimonCB765 ▴ 150 • written 12.4 years ago by toshnam ▴ 650

Michael Kuhn · Answer 1 · 2011-11-10

You could also use UCLUST http://www.drive5.com/usearch/, which functions in a similar way to CD-HIT. I've had better experiences with UCLUST than CD-HIT in the past.

An alternative is to run you sequences through BLAST. You can then define a similarity threshold (10%, 20%, etc.), and remove sequences from your dataset until no two sequences have a pairwise identity greater than your chosen threshold. BLAST would be time consuming to run on such a large dataset, but if you have the time to spare/sufficient computing power it will enable you to generate a dataset with less redundancy than either UCLUST or CD-HIT. I can explain further if you need more information on how to perform this.

UPDATE

As requested some more information on the BLAST based approach. The way I've done this in the past is to make a BLAST database of all the genes, and BLAST each individual sequence against the database. In your case this would give 113,195 BLAST output files. I then parsed these files, and generated one big csv file where each line was like: Gene A,Gene B,Similarity

If you know Perl, then you can have a look at the source code for PISCES http://dunbrack.fccc.edu/Guoli/pisces_download.php. It is designed for proteins, but with a few modifications it would work for genes. In order to remove the redundancy as PISCES does, you would also also need the length of the sequence of each protein/gene. You then make a list of all your genes, with the gene with the longest sequence at the top of the list.

Lets assume you want no two genes to have a sequence identity of greater than 20%, i.e. no two genes are more than 20% similar to each other. Start at the top of your ordered list, and mark the gene at the top of the list as being included in your non-redundant set. Look in your similarity file you made earlier, and any genes in that that have >20% similarity with the gene at the top of the list are marked as not being in the non-redundant set. Move down your ordered list, until you find the next gene that is not marked as being in or out of the non-redundant set. Mark this gene as being in the non-redundant set, and all genes it is >20% similar to as being not in the non-redundant set. Keep moving down the list in this manner, until you reach the end of the list. You now have a set of genes marked as being in a non-redundant set.

There is a slightly better method called BlastCuller (paper). It was designed with proteins in mind, but is really just a generic algorithm for problems like this. The algorithm in the paper is fairly simple to implement, but a bit harder to describe here.

If you're confident in your coding skills, I would have a go at writing your code so that it uses the BlastCuller algorithm. Otherwise you can implement the PISCES algorithm yourself, or alter some of their Perl code. I believe you would just have to change the BLAST program you use, they use PSI-BLAST, and change the code that parses the BLAST output.

Hope this helps.

Simon

score 4 · Answer 2 · 2011-11-10

4

Entering edit mode

12.4 years ago

Michael Kuhn 5.0k

A simplistic solution is to cluster sequences with CD-HIT.

ADD COMMENT • link 12.4 years ago by Michael Kuhn 5.0k

0

Entering edit mode

Just keep in mind that this will also cluster paralogs and repeats into a single representative.