Question

Extracting a set of complete genes from de novo assembled unannotated contigs

0

Entering edit mode

7.2 years ago

Kevin D ▴ 30

Helle everyone,

Let's say I have a data set of 1000 genes gathered in a fasta file. Is there a way to blast them on de novo assembled and unannotated contigs and extract only those which are found complete (i.e. full length) and with 2 copies ?

The problem with blast is that I only got hits which do not often correspond to a full-length query gene due to more variable parts. As a result, subject genes are split in hits. As I want to see which of my 1000 genes are present in two copies in my contigs, I can not check by eye each of the 1000 blast tables to sum the query cover for each subject gene ID and guess the number of copies. Besides, I want to extract those complete genes and the blast action only return a table of hits.

Thanks for your suggestions!

blast gene extraction complete genes unannotated • 2.0k views

ADD COMMENT • link updated 7.2 years ago by Sentinel156 ▴ 190 • written 7.2 years ago by Kevin D ▴ 30

score 0 · Answer 1 · 2017-01-31

0

Entering edit mode

7.2 years ago

Sentinel156 ▴ 190

I'm assuming you are working on a diploid eukaryote? How can you possibly determine whether there are two copies of each gene from a de novo assembly? Won't reads from either chromosome simply be collapsed into a single contig with the variant sites annotated at best? What assembler did you use for this?

Secondly, can you use a liftover tool to annotate genes on your assembled contigs? I've used RATT in the past with success (there are probably better tools nowdays) but then I had access to a very well annotated reference genome.

ADD COMMENT • link 7.2 years ago by Sentinel156 ▴ 190

0

Entering edit mode

Right, I'm working on a diploid plant. I assume there are two copies because it is a 1st-generation interspecific hybrid so reads from either chromosome should not merge as the hybrid's parents are slightly different. I did not performed the assembly but I think Velvet was used to build contigs.

I also thought I could annotate the whole contigs and then look for my 1000 genes but annotation jobs take a lot of memory on the bioinfo cluster (i.e. EuGene, Augustus,...). Since I'm not interested in the expected 30,000 genes that compose the genome, I prefer using a tool that only look for my 1000-gene set and extract the 2 copies. Besides, the assembly may not be of a good quality and I think it won't be a good idea to annotate it.

ADD REPLY • link 7.2 years ago by Kevin D ▴ 30