Question: Short Read Simulator For Cnv Indel?
gravatar for Pascal
8.8 years ago by
Pascal1.5k wrote:


Is there a way to simulate short reads with CNV indel (1-50kb)?

I've read wgsim manual for instance but it looks to generate small indels only.


ADD COMMENTlink written 8.8 years ago by Pascal1.5k

Some questions for you: - Copy number variation (CNV) or indel? - Does it make sense genetically to have an insertion of 50kb? - How many cases of large indels have been described in which region of the genome? - Large indels are most likely not neutral, how many of these per genome could exist at the same time? (possibly max. 1)?

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k

Further questions: Diploid or haploid genome/simulation?

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k

So CNV indel is supposed to mean to simulate a gene duplication or deletion?

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k
gravatar for Stefano Berri
8.8 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

I second Michael. Produce an "altered" genome and use wgsim from there. However, to make it realistic, you will have to make two copies of each chromosome and introduce a CNV in one of them(*). Do NOT enter random sequences, but, if you need to make an amplification/duplication enter a sequence copied from somewhere else, so that you will be able to find wich regions have been duplicated.

When I did some simulations, I found that version 2.6 is better than 3.0 as 3.0 seems to have some sort of "chromosome specific" bias. I was gettin uneven coverage...

(*) be careful though. wgsim will produce mutation and small indels from each crhomosome, so th frequency of them will be twice as much (because you have twice as many chromosomes) but each mutation will be either heterozygous or, apparently, in 25% of your reads (ass opposed to 100% or 50%)

ADD COMMENTlink written 8.8 years ago by Stefano Berri4.1k

Btw: for simulating realistic loci also for X/Y chromosome, these should not be duplicated, only the autosome. If variation on the on the sex chromosomes is required they should be treated separately.

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k

First, sorry to come back to you on this issue that late. If I understood you correctly Stefano, I should do, in order to introduce a CNV in chromosome 20 (for instance): 1) take the reference FASTA file for chr20, 2) copy it, 3) introduce in one of the two files my CNV, 4) process both FASTA files with wgsim to create reads corresponding to the diploid genome. Is that what you mean?

ADD REPLYlink written 8.7 years ago by Pascal1.5k

yes. I would recomend to keep all the sequences together in a single fasta file. In this way the read number will be proportional to the length of the new chromosomes. Otherwise you will have relatively less reads from the chr20+insert (unless you specify for each fasta the right num of sequences.)

Unless you are planning to find reads across the breakpoint, you can just add the fasta of the amplified segment.

Hope this help

ADD REPLYlink written 8.7 years ago by Stefano Berri4.1k

Hi all.

I'm trying to test tools to generate such genomes with SVs. I'm already testing SCNVSim, but I'd like to try other tools. Any other options now in 2015?


ADD REPLYlink written 5.1 years ago by Leandro Lima960
gravatar for Michael Dondrup
8.8 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

A read simulator with this feature is not required (maybe it exists anyway, but who cares?). You can simply modify the input sequence, the reference genome. Draw N (not much more than 1 makes sense to me) random chromosome location (chrom., position), draw the desired indel length from e.g. Poisson distribution. Delete it from your fasta sequence, in case of insertion, insert random sequence at that point. Give this file as input to your read simulator. From your answer to my comments you can get the right parameters for a little script that will do it.

What do you want to do with it, btw? These variations will be very easy to detect by lack of coverage in the region anyway, given your coverage is high enough.

ADD COMMENTlink written 8.8 years ago by Michael Dondrup47k

No, your question is not pointless, it is just that it would be very easy to write a script that does this. That, given you need more than 1-2 inserts. I guess, opening a FASTA file and editing a position (and record it exactly) for me would still be faster than writing a small script. Maybe I'm going to write an example for this application in R though for fun.

So, do you need help with such a script?

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k

Thanks Michael. I understand by your answer and comment that my question is pointless. Although I aim to compare several SV detection algorithms from small to large indels, in term of accuracy, speed, mem/cpu consumption, etc. I though that simulating a dataset with different events (not only small ones) could be useful for that purpose. I can of course insert manually events as you described by this is not very convenient and prone to errors. Thanks for your answer it helps me a lot to understand what I am doing :-)

ADD REPLYlink written 8.8 years ago by Pascal1.5k

No please!! Don't spend more time on this: you already helped me a lot, really! I would feel very confused if you dedicate more of your time. So if I understand well, the idea is to edit the fasta file of a genome reference, copy n times a portion of the sequence (this portion length is inferior to read length I guess) and then generate reads with a reads simulator. Right?

ADD REPLYlink written 8.8 years ago by Pascal1.5k

Yes, that's what one could do (and I, personally would do that only if the number n was <=3, otherwise I would write a script). Doesn't sound like rocket science, and I agree it is not a very clean solution. Actually it's quick and dirty and you need to record exactly, where you put that sequence. For producing heterozygous loci, please refer to Stefanos answer.

ADD REPLYlink written 8.8 years ago by Michael Dondrup47k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1545 users visited in the last hour