Detecting gene duplication in de-novo assembly
1
0
Entering edit mode
7.9 years ago
joneill4x ▴ 160

I have recently assembled a genome de-novo. I am looking for fertility genes where there is only 1 copy of the gene in the true genome. I am worried that if there are multiple copies of a gene in the true genome, they will appear as a single gene in the de-novo assembly. Is this a reasonable concern? If so, how would I detect this?

Thank you,

Joe

Edit - I do not have multiple samples, as many CNV detection tools require. I have only the reads (miSeq, hiSeq, and PacBio) and a reference assembly.

genome sequencing next-gen gene • 2.5k views
ADD COMMENT
1
Entering edit mode
7.9 years ago
igor 13k

A lot depends on your read length and library type.

I assume you are working with short reads, though. Once you have your assembled genome, you can align your reads back to it and do copy number analysis that way. Basically, the number of reads covering a gene should correlate to copy number. For example, if a gene is duplicated, it should have twice as many reads covering it.

See this paper for a CNV analysis overview: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4394692/

Some previous discussion: Running 1.5M potentially different generalized linear models depending on distribution of read depth information to study CNV

ADD COMMENT
0
Entering edit mode

Depending on your organism/genome of interest it might be beneficial to work with long reads, such as Oxford Nanopore or PacBio.

ADD REPLY
0
Entering edit mode

Thanks Igor. I noticed that a lot of those tools require multiple samples in order to detect CNV. In my case, I only have reads (PE miseq, PE hiseq, and PacBio) and a de-novo assembly. So my problem is a little different than the typical CNV detection it seems.

ADD REPLY
1
Entering edit mode

Could you expand a bit on which organism/genome you are working? If confidential, perhaps just the size and ploidy will suffice. What coverage do you have with PacBio?

ADD REPLY
0
Entering edit mode

Diploid genome, Heterozygous rate 0.01 - 0.02. Estimated genome size 700 000 000 bases. ~15X PacBio ~65X Illumina coverage

ADD REPLY
1
Entering edit mode

Sounds pretty decent to me, can't judge the quality of your assembly obviously. You could investigate whether the coverage of both (but separately) the illumina and Pacbio reads is evenly distributed over your genome, normalized for GC content, to check for collapsed repetitive elements.

ADD REPLY
0
Entering edit mode

OK! I'll give that a try. Thanks Wouter!

ADD REPLY
0
Entering edit mode

RDXplorer looks like a good one for me to try. Edit - only for human genome, looks like I can't use it.

ADD REPLY

Login before adding your answer.

Traffic: 1832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6