Question: Extracting a set of complete genes from de novo assembled unannotated contigs
gravatar for Kevin D
2.6 years ago by
Kevin D30
INRA France
Kevin D30 wrote:

Helle everyone,

Let's say I have a data set of 1000 genes gathered in a fasta file. Is there a way to blast them on de novo assembled and unannotated contigs and extract only those which are found complete (i.e. full length) and with 2 copies ?

The problem with blast is that I only got hits which do not often correspond to a full-length query gene due to more variable parts. As a result, subject genes are split in hits. As I want to see which of my 1000 genes are present in two copies in my contigs, I can not check by eye each of the 1000 blast tables to sum the query cover for each subject gene ID and guess the number of copies. Besides, I want to extract those complete genes and the blast action only return a table of hits.

Thanks for your suggestions!

ADD COMMENTlink modified 2.6 years ago by Sentinel156120 • written 2.6 years ago by Kevin D30
gravatar for Sentinel156
2.6 years ago by
Melbourne, Australia
Sentinel156120 wrote:

I'm assuming you are working on a diploid eukaryote? How can you possibly determine whether there are two copies of each gene from a de novo assembly? Won't reads from either chromosome simply be collapsed into a single contig with the variant sites annotated at best? What assembler did you use for this?

Secondly, can you use a liftover tool to annotate genes on your assembled contigs? I've used RATT in the past with success (there are probably better tools nowdays) but then I had access to a very well annotated reference genome.

ADD COMMENTlink written 2.6 years ago by Sentinel156120

Right, I'm working on a diploid plant. I assume there are two copies because it is a 1st-generation interspecific hybrid so reads from either chromosome should not merge as the hybrid's parents are slightly different. I did not performed the assembly but I think Velvet was used to build contigs.

I also thought I could annotate the whole contigs and then look for my 1000 genes but annotation jobs take a lot of memory on the bioinfo cluster (i.e. EuGene, Augustus,...). Since I'm not interested in the expected 30,000 genes that compose the genome, I prefer using a tool that only look for my 1000-gene set and extract the 2 copies. Besides, the assembly may not be of a good quality and I think it won't be a good idea to annotate it.

ADD REPLYlink written 2.6 years ago by Kevin D30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1137 users visited in the last hour