Hello everybody,
I am currently trying to create a dataset of simulated exome reads (with simulated base qualities as well). I am currently using simNGS from EBI to create a fragment library and create reads. The main issue with this one is that with a workflow like:
fasta genome reference -> fragment library -> simulated reads.
We have simulated genomic reads instead of exomic reads. A workflow such as:
fasta exome reference -> fragment library -> simulated reads
doesnt’ sound so good, because it leaves out lots of exome-specific behaviours (e.g, off target reads).
Is there any other approach to follow in order to obtain a reliable whole-exome simulated read dataset?
For what reason are you doing your simulation? Could you add to your question details about what behaviors you want to capture and why?
I'm interested in whole-exome sequencing variant call behaviour, specifically how base quality distortion affect variant call (on SNP level). I'm focusing on Illumina platforms.
For the purposes of variant calling behavior, do you need to model things like off-target reads and variable coverage since you are interested in base quality distortion? I know you want to, but do you need to?
the main point is that we use a complete whole-exome variant call pipeline (including the mapping step). We'd like to have a more realistic simulated read dataset to avoid additional bias.. but if there's no way we'll stick to what we have.
In the case you are interested, you would probably want to simulate your (hybrid) selection process. I have no idea how effectively hybridization characteristics of a genomic library versus a pool of several hundred thousand oligos can be captured, both from a computational and from a biochemical view. The value of doing so is also unknown, at least to me.