Question: snp calling using reference with no raw data
gravatar for Yahan
4.2 years ago by
Yahan370 wrote:

I have to do a snp calling on 40 or so samples a few of which originate from public sources. For these the raw data is not available in all cases.

Therefore I thought of building a dummy fastq paired dataset by chopping the reference into pieces using a window approach to add some coverage.

Any thoughts on this?

I would remove all monomorphic calls for this sample, and apply default filters like snps in repeat regions and near-indel-snps.

An alternative would be to compare the reference on which mapping will be done with this reference using Mummer. But then I would have to integrate the calls into the vcf and snp calling metrics would be absent for this sample.

Neither of the two I like very much but I don't see an alternative really.

Thanks for any suggestion.

snp calling • 1.2k views
ADD COMMENTlink modified 4.2 years ago by geek_y9.1k • written 4.2 years ago by Yahan370

Whats the goal of your analysis ? Do you want dummy SNPs for the samples where you don't have raw data ? Your question and approach is not clear.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by geek_y9.1k

Well, we want to know the genotype call for the raw-less sample(s), provided that the calls from the other samples are confident enough to accept the snp. We will then filter downstream for assay design taking the calls for all samples into account. So I would maybe not use the raw-less calls for quality filtering

The filtering will be done on the VCF so I need the calls in there.

removing the monomorphs will discard some true positives but we're not really interested in those, so that's not a big problem, except maybe that if they're in the flanking sequence they could hamper the assay.

I guess I have to weigh what's the most work, creating the dummy fastq or adding the calls done by direct reference comparison to the VCF.

ADD REPLYlink written 4.2 years ago by Yahan370
gravatar for geek_y
4.2 years ago by
geek_y9.1k wrote:

If you would like to create raw fastq files, you need to replicate the wetlab protocol of that platform.

For e.g illumina-HiSeq, take the fasta file of genome and randomly fragment it in to multiple chunks. Then get the sequences and size select them (like select the fragments which are of 300-500bp) and read 100bp from the both the ends. You may ignore the error rates and quality information for now. Window based approach is not correct as genomic DNA is randomly fragmented. Or simply use the existing NGS data simulator programs which takes care of all the parameters.



ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by geek_y9.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 700 users visited in the last hour