Question: Confirming Structural Variants In The Genome
gravatar for Rubal7
8.3 years ago by
Rubal7770 wrote:

Hi Everyone,

I'd like to get some advice/ideas on how to confirm that there is structural variation in a specific genomic region in a population. We sequenced individuals from two populations to high coverage, ~50X worth of data for each population. We detect a region of the genome ~500kb in size where we see double the amount of coverage in one population compared to the other. We therefore suspect the region is duplicated in one population (we also see a large increase in heterozygosity in the population with the higher coverage, which is presumably from mapping two genomic regions to one location).

We'd like to confirm that there is indeed a duplication here and that there is not just a random increase in coverage at one population at this site. One approach we are considering is searching for 'junction fragments', the reads that contain part of both the original sequence and the new duplicated sequence. Presumably these will not have been mapped as in the reference genome they have no close correlate. If anybody knows of a good way to do this or knows any papers or software that deal with this problem that would be great.

Any other ideas for confirming the presence of copy number variation is appreciated. Ideally methods that we could use on the existing data rather than resequencing.

*I should specify that we would also like to know exactly where the structural variant begins and ends.

Many thanks in advance

genome coverage • 2.0k views
ADD COMMENTlink modified 2.2 years ago by Biostar ♦♦ 20 • written 8.3 years ago by Rubal7770

What species? Are the "individuals" expected to be homogeneous, genetically?

ADD REPLYlink written 8.3 years ago by Sean Davis26k
gravatar for Alex Paciorkowski
8.3 years ago by
Rochester, NY USA
Alex Paciorkowski3.4k wrote:

I would use a second non-informatics technique to confirm this. You should be able to design 4-5 primers at various spots within and flanking your 500 kb duplication, and perform quantitative PCR. You don't say what n you are working with in your populations, but qPCR is a relatively efficient way of detecting copy number in large populations (96 -well plates at a time). The biggest problem is running enough controls, we usually use 3 housekeeping genes from non-duplicated regions. Array CGH would be painful if you have large numbers in your populations, expensive, and you really only want to know about this one region anyway. If there is a BAC within your area of duplication (look in UCSC) you could order probes for that and prove the duplication with FISH -- but that is more work than qPCR.

ADD COMMENTlink written 8.3 years ago by Alex Paciorkowski3.4k
gravatar for Vikas Bansal
8.3 years ago by
Vikas Bansal2.4k
Berlin, Germany
Vikas Bansal2.4k wrote:

If you really want to confirm, then I would suggest to use wet lab methods. Example - PCR, MLPA, CGH, FISH.

ADD COMMENTlink written 8.3 years ago by Vikas Bansal2.4k
gravatar for liz.batty
8.3 years ago by
liz.batty30 wrote:

I think you can use your existing sequence data to help narrow down the breakpoints. There's a whole bunch of software and approaches for detecting structural variation - try for a start. You can use the paired -end reads to help - look for pairs where one of the reads is mapped in the 500kb region and one is mapped either to another chromosome, or more likely (if it's a tandem duplication) to another location on the same chromosome but the reads map further apart than you would expect from the library insert size. You can also use the single-end reads which have split mapping as you suggest above - try , there's probably other software.

That should get the breakpoint region down to a range where you can PCR and sequence it with a single set of primers, at least in one patient, and the bioinformatic methods should get you close enough to know whether you're going to have closely clustered breakpoints in all your population or a wide range.

ADD COMMENTlink written 8.3 years ago by liz.batty30
gravatar for Rubal7
8.3 years ago by
Rubal7770 wrote:

Thanks Alex and Vikas, these are both useful ideas. I should have specified in my original question that I am also keen to know exactly where the copy number variation occurs, at nucleotide resolution, so that we can see if the variation is likely to be disrupting gene expression, for example by occuring in the middle of a gene. I can see how this could also be done with a large set of primers, but I believe in this case a simpler approach would be bioinformatic using the extensive sequence data we have. But perhaps I am wrong.

ADD COMMENTlink written 8.3 years ago by Rubal7770

You could edit your question or use comments. If you want to know the place of duplication (in your case) i:e if you want to know, after duplication where does that duplicated region gets inserted in the genome, then only bioinformatic approach is not a good solution. Because, as in your case you found duplication in that 500kb region (assuming read depth approach), it tells you the region which got duplicated not where it got inserted after duplication.

ADD REPLYlink modified 8.3 years ago • written 8.3 years ago by Vikas Bansal2.4k

True, I currently don't know where this region sits in the genome (presumably the most likely answer for a CNV is that its a tandem duplication) But I was thinking if we find the reads that are contain a certain portion of bases that align to the original region and the rest that map to an unexpected location these would tell us where the CNV occurs. I believe these reads are known as 'junction fragments' But without knowing exactly where the CNV ends it leaves a very large search space of possible unmapped reads to check through.

ADD REPLYlink written 8.3 years ago by Rubal7770

I don't think I would rely on the approach you suggest to tell you anything with certainty. If you want to verify the duplication exists -- use qPCR. If you want to visualize where the duplication is inserted you need to use FISH. I would not assume the duplication is in tandem at all without any evidence to that fact. The dup could be on a supernumerary marker chromosome for all you know. If you want to see if gene expression is affected by the duplication you need to do some mRNA work. Otherwise, you are left with predictions that are not biologically validated.

ADD REPLYlink modified 8.3 years ago • written 8.3 years ago by Alex Paciorkowski3.4k

I agree with Alex. Just curious, how long your reads are?

ADD REPLYlink written 8.3 years ago by Vikas Bansal2.4k

125bp single end. We have some paired end reads that could also be very useful in this regard.

ADD REPLYlink modified 8.3 years ago • written 8.3 years ago by Rubal7770
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2040 users visited in the last hour