Question: Validation of SNP and INDEL calling pipeline
0
gravatar for kspata
9 days ago by
kspata50
Chicago
kspata50 wrote:

Hi,

I am validating an in-house pipeline for calling SNP and INDELS for small genomes. For this purpose I am using the GIAB NA12878 HiSeq 2500 300X coverage dataset. ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/

I have downloaded all fastq files from this folder and merged the forward reads for two lanes into a single fastq file and same with reverse reads.

  1. Do i need to download and merge reads from other folders as well? example from Sample_U0b, Sample_U0c and so on with Sample_U0a. Will files from only Sample_U0a give a coverage of 300X. I can not find any explanation whether to merge files from all Samples or just from one Sample.

The samples which I often deal with are 3000 - 12000 bp long ssDNA virus or dsDNA plasmid.

  1. Can I map the reads from NA12878 to a specific gene (same length as above) and generate a subset of reads mapped to that region, which I can use for further analysis of SNP and INDELs? Will this approach work for validation?

Previously I was using PhiX dataset from illumina for validation of pipeline the problem with that is it has only SNPs which are validated and not INDELs.

  1. Is there any other plasmid/viral datatset other than PhiX which I can use for validation? It should contain both SNP's and INDELs (10bp or more long) at different variant frequencies.

Thanks in advance!!

snp alignment indel genome • 133 views
ADD COMMENTlink modified 9 days ago by WouterDeCoster39k • written 9 days ago by kspata50
0
gravatar for WouterDeCoster
9 days ago by
Belgium
WouterDeCoster39k wrote:

Can I map the reads from NA12878 to a specific gene (same length as above) and generate a subset of reads mapped to that region, which I can use for further analysis of SNP and INDELs? Will this approach work for validation?

No, you should always align to the full genome. Restricting the alignment to a certain gene or region could bias your results and lead to false positive alignments, and as such affect variant calling.

ADD COMMENTlink written 9 days ago by WouterDeCoster39k

@wouterDeCoster,

Thank you for your input. But we don't have the system capacity to analyze whole genome data and map it to the entire human genome reference. I used data in the above link and mapped to chr21 which gives very less average per base coverage of 3, which is not desired for downstream variant calls.

Can I use a read simulator like dwgsim (https://github.com/nh13/DWGSIM) to for SNP and INDEL pipeline validation?

ADD REPLYlink written 5 days ago by kspata50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 971 users visited in the last hour