Question

How To Simulate Whole-Exome Reads

1

Entering edit mode

11.1 years ago

mpallocc ▴ 10

Hello everybody,

I am currently trying to create a dataset of simulated exome reads (with simulated base qualities as well). I am currently using simNGS from EBI to create a fragment library and create reads. The main issue with this one is that with a workflow like:

fasta genome reference -> fragment library -> simulated reads.

We have simulated genomic reads instead of exomic reads. A workflow such as:

fasta exome reference -> fragment library -> simulated reads

doesnt’ sound so good, because it leaves out lots of exome-specific behaviours (e.g, off target reads).

Is there any other approach to follow in order to obtain a reliable whole-exome simulated read dataset?

simulation • 3.8k views

ADD COMMENT • link 11.1 years ago by mpallocc ▴ 10

0

Entering edit mode

For what reason are you doing your simulation? Could you add to your question details about what behaviors you want to capture and why?

ADD REPLY • link 11.1 years ago by Sean Davis 26k

0

Entering edit mode

I'm interested in whole-exome sequencing variant call behaviour, specifically how base quality distortion affect variant call (on SNP level). I'm focusing on Illumina platforms.

ADD REPLY • link 11.1 years ago by mpallocc ▴ 10

1

Entering edit mode

For the purposes of variant calling behavior, do you need to model things like off-target reads and variable coverage since you are interested in base quality distortion? I know you want to, but do you need to?

ADD REPLY • link 11.1 years ago by Sean Davis 26k

0

Entering edit mode

the main point is that we use a complete whole-exome variant call pipeline (including the mapping step). We'd like to have a more realistic simulated read dataset to avoid additional bias.. but if there's no way we'll stick to what we have.

ADD REPLY • link 11.1 years ago by mpallocc ▴ 10

0

Entering edit mode

In the case you are interested, you would probably want to simulate your (hybrid) selection process. I have no idea how effectively hybridization characteristics of a genomic library versus a pool of several hundred thousand oligos can be captured, both from a computational and from a biochemical view. The value of doing so is also unknown, at least to me.

ADD REPLY • link 11.1 years ago by Sean Davis 26k