Hi, I would like to ask if you can recommend me tools that I can use to simulate whole genome sequences using a reference genome which will also give me list of the variants just like wgsim?
I was able to use wgsim, but when I tried aligning the paired-reads, it's not aligning together.
reads=100000000
for i in $(seq 1 10); do
base="sim_${i}"
seed=$((100 + i)) # different seed for each run (arbitrary choice)
echo "Running $base (seed=$seed)..."
"$wgsim" "$ref" "${base}.R1.fq" "${base}.R2.fq" -1 70 -2 70 -N $reads -S $seed -e 0.0001 > "${base}.out.log" 2>&1
echo "$base finished (log: ${base}.out.log)"
done
of maybe I am using wgsim wrong? I hope you can help me. Thank you!
It's a little unclear what you mean by "I was able to use wgsim, but when I tried aligning the paired-reads, it's not aligning together."
Sorry about that. It's when I align the paired reads together like the left and righ, they should align with common sequence at some ends, making one longer consensus sequence. I was able to align the pairs with an actual short read pairs.
I believe the pairs will only have common sequence (e.g. the pairs will "overlap") if the insert size is small. the wgsim program has a flag called -d ("outer distance between the two ends") which i think can adjust the insert size and might be able to be made smaller (default: 500) to make it produce some overlap but I haven't tested it myself
You don't want to simulate reads like this. Good WGS libraries should not have reads that overlap in the middle because they will represent short inserts. Not what one wants in real life.
You want more data to cover a particular mutation intriduced by simulation by having more read pairs covering it, rather than having anoverlapping read pair covering that mutation.