Question: Detecting gene duplication in de-novo assembly
0
gravatar for joneill4x
3.7 years ago by
joneill4x60
Canada
joneill4x60 wrote:

I have recently assembled a genome de-novo. I am looking for fertility genes where there is only 1 copy of the gene in the true genome. I am worried that if there are multiple copies of a gene in the true genome, they will appear as a single gene in the de-novo assembly. Is this a reasonable concern? If so, how would I detect this?

Thank you,

Joe

Edit - I do not have multiple samples, as many CNV detection tools require. I have only the reads (miSeq, hiSeq, and PacBio) and a reference assembly.

sequencing next-gen gene genome • 1.6k views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by joneill4x60
1
gravatar for igor
3.7 years ago by
igor9.5k
United States
igor9.5k wrote:

A lot depends on your read length and library type.

I assume you are working with short reads, though. Once you have your assembled genome, you can align your reads back to it and do copy number analysis that way. Basically, the number of reads covering a gene should correlate to copy number. For example, if a gene is duplicated, it should have twice as many reads covering it.

See this paper for a CNV analysis overview: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4394692/

Some previous discussion: Running 1.5M potentially different generalized linear models depending on distribution of read depth information to study CNV

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by igor9.5k

Depending on your organism/genome of interest it might be beneficial to work with long reads, such as Oxford Nanopore or PacBio.

ADD REPLYlink written 3.7 years ago by WouterDeCoster43k

Thanks Igor. I noticed that a lot of those tools require multiple samples in order to detect CNV. In my case, I only have reads (PE miseq, PE hiseq, and PacBio) and a de-novo assembly. So my problem is a little different than the typical CNV detection it seems.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by joneill4x60
1

Could you expand a bit on which organism/genome you are working? If confidential, perhaps just the size and ploidy will suffice. What coverage do you have with PacBio?

ADD REPLYlink written 3.7 years ago by WouterDeCoster43k

Diploid genome, Heterozygous rate 0.01 - 0.02. Estimated genome size 700 000 000 bases. ~15X PacBio ~65X Illumina coverage

ADD REPLYlink written 3.7 years ago by joneill4x60
1

Sounds pretty decent to me, can't judge the quality of your assembly obviously. You could investigate whether the coverage of both (but separately) the illumina and Pacbio reads is evenly distributed over your genome, normalized for GC content, to check for collapsed repetitive elements.

ADD REPLYlink written 3.7 years ago by WouterDeCoster43k

OK! I'll give that a try. Thanks Wouter!

ADD REPLYlink written 3.7 years ago by joneill4x60

RDXplorer looks like a good one for me to try. Edit - only for human genome, looks like I can't use it.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by joneill4x60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1760 users visited in the last hour