Question

Detecting gene duplication in de-novo assembly

0

Entering edit mode

7.9 years ago

joneill4x ▴ 160

I have recently assembled a genome de-novo. I am looking for fertility genes where there is only 1 copy of the gene in the true genome. I am worried that if there are multiple copies of a gene in the true genome, they will appear as a single gene in the de-novo assembly. Is this a reasonable concern? If so, how would I detect this?

Thank you,

Joe

Edit - I do not have multiple samples, as many CNV detection tools require. I have only the reads (miSeq, hiSeq, and PacBio) and a reference assembly.

genome sequencing next-gen gene • 2.5k views

ADD COMMENT • link 7.9 years ago by joneill4x ▴ 160

score 1 · Answer 1 · 2016-05-30

1

Entering edit mode

7.9 years ago

igor 13k

A lot depends on your read length and library type.

I assume you are working with short reads, though. Once you have your assembled genome, you can align your reads back to it and do copy number analysis that way. Basically, the number of reads covering a gene should correlate to copy number. For example, if a gene is duplicated, it should have twice as many reads covering it.

See this paper for a CNV analysis overview: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4394692/

Some previous discussion: Running 1.5M potentially different generalized linear models depending on distribution of read depth information to study CNV

ADD COMMENT • link 7.9 years ago by igor 13k

0

Entering edit mode

Depending on your organism/genome of interest it might be beneficial to work with long reads, such as Oxford Nanopore or PacBio.

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks Igor. I noticed that a lot of those tools require multiple samples in order to detect CNV. In my case, I only have reads (PE miseq, PE hiseq, and PacBio) and a de-novo assembly. So my problem is a little different than the typical CNV detection it seems.

ADD REPLY • link 7.9 years ago by joneill4x ▴ 160

1

Entering edit mode

Could you expand a bit on which organism/genome you are working? If confidential, perhaps just the size and ploidy will suffice. What coverage do you have with PacBio?

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Diploid genome, Heterozygous rate 0.01 - 0.02. Estimated genome size 700 000 000 bases. ~15X PacBio ~65X Illumina coverage

ADD REPLY • link 7.9 years ago by joneill4x ▴ 160

1

Entering edit mode

Sounds pretty decent to me, can't judge the quality of your assembly obviously. You could investigate whether the coverage of both (but separately) the illumina and Pacbio reads is evenly distributed over your genome, normalized for GC content, to check for collapsed repetitive elements.

ADD REPLY • link 7.9 years ago by WouterDeCoster 47k

0

Entering edit mode

OK! I'll give that a try. Thanks Wouter!

ADD REPLY • link 7.9 years ago by joneill4x ▴ 160

0

Entering edit mode

RDXplorer looks like a good one for me to try. Edit - only for human genome, looks like I can't use it.

ADD REPLY • link 7.9 years ago by joneill4x ▴ 160