Extracting a set of complete genes from de novo assembled unannotated contigs
1
0
Entering edit mode
7.2 years ago
Kevin D ▴ 30

Helle everyone,

Let's say I have a data set of 1000 genes gathered in a fasta file. Is there a way to blast them on de novo assembled and unannotated contigs and extract only those which are found complete (i.e. full length) and with 2 copies ?

The problem with blast is that I only got hits which do not often correspond to a full-length query gene due to more variable parts. As a result, subject genes are split in hits. As I want to see which of my 1000 genes are present in two copies in my contigs, I can not check by eye each of the 1000 blast tables to sum the query cover for each subject gene ID and guess the number of copies. Besides, I want to extract those complete genes and the blast action only return a table of hits.

Thanks for your suggestions!

blast gene extraction complete genes unannotated • 2.0k views
ADD COMMENT
0
Entering edit mode
7.2 years ago
Sentinel156 ▴ 190

I'm assuming you are working on a diploid eukaryote? How can you possibly determine whether there are two copies of each gene from a de novo assembly? Won't reads from either chromosome simply be collapsed into a single contig with the variant sites annotated at best? What assembler did you use for this?

Secondly, can you use a liftover tool to annotate genes on your assembled contigs? I've used RATT in the past with success (there are probably better tools nowdays) but then I had access to a very well annotated reference genome.

ADD COMMENT
0
Entering edit mode

Right, I'm working on a diploid plant. I assume there are two copies because it is a 1st-generation interspecific hybrid so reads from either chromosome should not merge as the hybrid's parents are slightly different. I did not performed the assembly but I think Velvet was used to build contigs.

I also thought I could annotate the whole contigs and then look for my 1000 genes but annotation jobs take a lot of memory on the bioinfo cluster (i.e. EuGene, Augustus,...). Since I'm not interested in the expected 30,000 genes that compose the genome, I prefer using a tool that only look for my 1000-gene set and extract the 2 copies. Besides, the assembly may not be of a good quality and I think it won't be a good idea to annotate it.

ADD REPLY

Login before adding your answer.

Traffic: 2471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6