Tools to simulate Illumina short read sequences and ONT long reads with a reference genome
3
0
Entering edit mode
1 day ago
PolenP • 0

Hi, I would like to ask if you can recommend me tools that I can use to simulate whole genome sequences using a reference genome which will also give me list of the variants just like wgsim?

I was able to use wgsim, but when I tried aligning the paired-reads, it's not aligning together.

reads=100000000

for i in $(seq 1 10); do
  base="sim_${i}"
  seed=$((100 + i))   # different seed for each run (arbitrary choice)
  echo "Running $base  (seed=$seed)..."
  "$wgsim" "$ref" "${base}.R1.fq" "${base}.R2.fq" -1 70 -2 70 -N $reads -S $seed -e 0.0001 > "${base}.out.log" 2>&1
  echo "$base finished (log: ${base}.out.log)"
done

of maybe I am using wgsim wrong? I hope you can help me. Thank you!

simulate reads bioinformatics short short-read • 2.5k views
ADD COMMENT
0
Entering edit mode

It's a little unclear what you mean by "I was able to use wgsim, but when I tried aligning the paired-reads, it's not aligning together."

ADD REPLY
0
Entering edit mode

Sorry about that. It's when I align the paired reads together like the left and righ, they should align with common sequence at some ends, making one longer consensus sequence. I was able to align the pairs with an actual short read pairs.

ADD REPLY
0
Entering edit mode

I believe the pairs will only have common sequence (e.g. the pairs will "overlap") if the insert size is small. the wgsim program has a flag called -d ("outer distance between the two ends") which i think can adjust the insert size and might be able to be made smaller (default: 500) to make it produce some overlap but I haven't tested it myself

ADD REPLY
0
Entering edit mode

they should align with common sequence at some ends, making one longer consensus sequence.

You don't want to simulate reads like this. Good WGS libraries should not have reads that overlap in the middle because they will represent short inserts. Not what one wants in real life.

You want more data to cover a particular mutation intriduced by simulation by having more read pairs covering it, rather than having anoverlapping read pair covering that mutation.

ADD REPLY
3
Entering edit mode
1 day ago

The best one I've used for ONT reads so far is badread - https://github.com/rrwick/Badread

For illumina I've used and like insilicoseq - https://insilicoseq.readthedocs.io/en/latest/

ADD COMMENT
0
Entering edit mode

which will also give me list of the variants

Do either of these packages satisfy the requirement of generating known mutations?

ADD REPLY
2
Entering edit mode
ADD COMMENT
0
Entering edit mode

Do you know if ART is able to generate known mutations.

ADD REPLY
0
Entering edit mode

I think only a fraction of sequencing errors can be specified. If a mix of a reference and a mutated genome is included, setting sequencing errors to 0 might give reads with fixed mutations.

ADD REPLY
0
Entering edit mode
1 day ago
GenoMax 154k

if you can recommend me tools

You can use randomreads from BBMap suite to generate short illumina reads with known mutations. A guide is available here: https://bbmap.org/tools/randomreads

ADD COMMENT

Login before adding your answer.

Traffic: 4581 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6