Question

Validation of SNP and INDEL calling pipeline

0

Entering edit mode

4.9 years ago

kspata ▴ 80

Hi,

I am validating an in-house pipeline for calling SNP and INDELS for small genomes. For this purpose I am using the GIAB NA12878 HiSeq 2500 300X coverage dataset. ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/

I have downloaded all fastq files from this folder and merged the forward reads for two lanes into a single fastq file and same with reverse reads.

Do i need to download and merge reads from other folders as well? example from Sample_U0b, Sample_U0c and so on with Sample_U0a. Will files from only Sample_U0a give a coverage of 300X. I can not find any explanation whether to merge files from all Samples or just from one Sample.

The samples which I often deal with are 3000 - 12000 bp long ssDNA virus or dsDNA plasmid.

Can I map the reads from NA12878 to a specific gene (same length as above) and generate a subset of reads mapped to that region, which I can use for further analysis of SNP and INDELs? Will this approach work for validation?

Previously I was using PhiX dataset from illumina for validation of pipeline the problem with that is it has only SNPs which are validated and not INDELs.

Is there any other plasmid/viral datatset other than PhiX which I can use for validation? It should contain both SNP's and INDELs (10bp or more long) at different variant frequencies.

Thanks in advance!!

genome alignment SNP INDEL • 1.4k views

ADD COMMENT • link updated 4.9 years ago by WouterDeCoster 47k • written 4.9 years ago by kspata ▴ 80

score 0 · Answer 1 · 2019-06-08

0

Entering edit mode

4.9 years ago

WouterDeCoster 47k

Can I map the reads from NA12878 to a specific gene (same length as above) and generate a subset of reads mapped to that region, which I can use for further analysis of SNP and INDELs? Will this approach work for validation?

No, you should always align to the full genome. Restricting the alignment to a certain gene or region could bias your results and lead to false positive alignments, and as such affect variant calling.

ADD COMMENT • link 4.9 years ago by WouterDeCoster 47k

0

Entering edit mode

@wouterDeCoster,

Thank you for your input. But we don't have the system capacity to analyze whole genome data and map it to the entire human genome reference. I used data in the above link and mapped to chr21 which gives very less average per base coverage of 3, which is not desired for downstream variant calls.

Can I use a read simulator like dwgsim (https://github.com/nh13/DWGSIM) to for SNP and INDEL pipeline validation?

ADD REPLY • link 4.9 years ago by kspata ▴ 80