Validation of SNP and INDEL calling pipeline
1
0
Entering edit mode
4.9 years ago
kspata ▴ 80

Hi,

I am validating an in-house pipeline for calling SNP and INDELS for small genomes. For this purpose I am using the GIAB NA12878 HiSeq 2500 300X coverage dataset. ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/

I have downloaded all fastq files from this folder and merged the forward reads for two lanes into a single fastq file and same with reverse reads.

  1. Do i need to download and merge reads from other folders as well? example from Sample_U0b, Sample_U0c and so on with Sample_U0a. Will files from only Sample_U0a give a coverage of 300X. I can not find any explanation whether to merge files from all Samples or just from one Sample.

The samples which I often deal with are 3000 - 12000 bp long ssDNA virus or dsDNA plasmid.

  1. Can I map the reads from NA12878 to a specific gene (same length as above) and generate a subset of reads mapped to that region, which I can use for further analysis of SNP and INDELs? Will this approach work for validation?

Previously I was using PhiX dataset from illumina for validation of pipeline the problem with that is it has only SNPs which are validated and not INDELs.

  1. Is there any other plasmid/viral datatset other than PhiX which I can use for validation? It should contain both SNP's and INDELs (10bp or more long) at different variant frequencies.

Thanks in advance!!

genome alignment SNP INDEL • 1.4k views
ADD COMMENT
0
Entering edit mode
4.9 years ago

Can I map the reads from NA12878 to a specific gene (same length as above) and generate a subset of reads mapped to that region, which I can use for further analysis of SNP and INDELs? Will this approach work for validation?

No, you should always align to the full genome. Restricting the alignment to a certain gene or region could bias your results and lead to false positive alignments, and as such affect variant calling.

ADD COMMENT
0
Entering edit mode

@wouterDeCoster,

Thank you for your input. But we don't have the system capacity to analyze whole genome data and map it to the entire human genome reference. I used data in the above link and mapped to chr21 which gives very less average per base coverage of 3, which is not desired for downstream variant calls.

Can I use a read simulator like dwgsim (https://github.com/nh13/DWGSIM) to for SNP and INDEL pipeline validation?

ADD REPLY

Login before adding your answer.

Traffic: 3000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6