Simulate Ngs Mapped Reads
1
2
Entering edit mode
11.1 years ago
FGV ▴ 170

Dear all,

I'm trying to simulate some Illumina reads to test some analysis but I don't want to deal with the mapping uncertainty. The idea is:

  • simulate N haplotypes from an ancestral seq
  • assume diploid indiv
  • simulate mapped NGS reads from each indiv (SAM output) without indels
  • "force" the ancestral seq as reference on all individuals. it all should map correctly since there are no indels.
  • call genotypes

So, I was looking for a NGS read simulator that would output SAM files directly. Actually I don't know why SAM is not the default output for these programs since converting from FASTQ to SAM is pretty straightforward.

I looked around and could only find ART, but I wanted to try a couple more.

thanks, FGV

ngs simulation reads • 3.8k views
ADD COMMENT
1
Entering edit mode
11.1 years ago

I think you are misunderstanding what ART and other such read simulation tools actually do. These generate short sequences (reads) based on some parameters. But these sequencing reads stored in a FASTQ file cannot be just "converted" to an a SAM alignment.

One would need to align these reads to a reference and the result of that process will be an alignment that may be in the SAM format. This is the process that you should be following as well.

ADD COMMENT
1
Entering edit mode

As far as I understand, these programs sample short reads from an input sequence under some parameters (coverage, error rate, etc..). Also, the diff between FASTQ and SAM is the mapping information and since the original sequence is known as well as the original position from where the read was sampled, I think it should be straightforward to get the "true" SAM file, no?

The thing is that I'm trying to avoid dealing with all parameters/programs/uncertainty associated with read mapping.

ADD REPLY
0
Entering edit mode

the oversimplification in your assumption is that each read will match uniquely to a given location whereas in a realistic genome is always different from the so called reference genome.

Thus a realistic simulated read that is generated from a reference may be quite different from the reference location - and finding the exact location of where it best aligns may not be known without an alignment step. This seems counter-intuitive since we supposedly do know which location it actually originates from.

But one cannot claim apriori that location to be the best alignment since the read could actually align somewhere else with higher quality. So any type of generated alignment from a simulated read would be of pretty bad quality.

ADD REPLY
0
Entering edit mode

Could you elaborate a bit more, please? I've edited the question to add a bit more detail.

I understand when you say that when we simulate a read from a genome the true location might not be the best match (seq errors or diff in reference). But why does it affect? If anything it should be more accurate since it is the true location after all, no?

However, I can tell you that when I call genotypes from these data I do find extremelly high error rates, but I really don't see why.

thank yo

ADD REPLY
2
Entering edit mode

because some of your alignments would be fake alignments that no aligner would be able to produce - so they are of little use for anything especially from the point of view of a simulator where the goal is to create and operate with realistic data that is affected by the complex structures present in the genome.

This is akin to "predicting" a mutation based on already knowing what that mutation actually is.

The right solution is to use the a simulation tool, then run the aligner that way you get alignments that will be more similar to what you would get from an experimental process. Look here as well: https://github.com/lh3/wgsim

ADD REPLY

Login before adding your answer.

Traffic: 1819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6